Suchada Manowon and Pruet Boonma

Published in Data Science and Engineering (DSE) Record 2023 Vol. 4 No. 1 pp. 108-124

PDF

Abstract

Flight delays persist as a challenge, which impacting airline and airport productivity, passenger experience, and financial resources. Nowadays, air transportation data predominantly rely on administrative records from var-ious institutions. This study aims to designing and implementing an effec-tive data pipeline system with the capacity to capture high-frequency data from diverse sources through batch processing. This comprehensive pipe-line encompasses the entire of end-to-end data pipeline stages; including data sourcing, ingestion, processing, storage, and analysis. The proposed pipeline system extracts data from various datasets, including flight data, airport information, airline details, airplane specifications, and routes. It employs a variety of methods such as web scraping, APIs, and da-tabase loading for data ingestion. It efficiently consolidates flight infor-mation, transforming and cleaning data and then loading it into a designated destination database. Additionally, this study establishes an automated batch processing platform using Apache Airflow. This platform is character-ized by a comprehensive evaluation across three essential aspects; 1. Sys-tem metrics, including memory and disk usage, 2. Job metrics extracted from Airflow metrics, which are utilized to monitor processes, ensuring smooth execution, 3. Data quality metrics that assess six dimensions – accuracy, validation, completeness, consistency, uniqueness, and timeliness – to en-sure the usability of the defined data. Leveraging the flight dataset for data analysis and data visualization, this approach involves the comparison of various base regression models for flight delay prediction. Additionally, flight data dashboards offer data in-sights. The implications of this multifaceted approach extend to enhancing air transportation statistics, predictive modeling capabilities, and facilitat-ing data-driven decision-making processes.