Piriya Boonchot
Published in Data Science and Engineering (DSE) Record 2025 Vol. 6 No. 1 pp. 566-581
Abstract
Air pollution caused by fine particulate matter (PM2.5) in Northeastern Thailand is a significant environmental concern. This study aims to identify the relationship between satellite-derived variables and PM2.5 concentrations and to establish an effective machine learning model for PM2.5 estimation. Sentinel-5P satellite data, comprising atmospheric variables including Carbon Monoxide (CO), Formaldehyde (HCHO), Nitrogen Dioxide (NO2), Ozone (O3), Sulfur Dioxide (SO2), Methane (CH4), and Aerosol Index (AI), were analyzed alongside ground-based PM2.5 measurements from 2018 to 2023. Based on Pearson correlation analysis of the atmospheric variables, it was found that Carbon Monoxide (r = 0.72) and Nitrogen Dioxide (r = 0.51) exhibited the strongest linear relationships with PM2.5 levels. Based on statistical significance and regional source characteristics, five key variables (CO, NO₂, HCHO, AI, and O3) were selected as input features to establish an effective machine learning model for PM2.5 estimation. Several predictive algorithms were developed and evaluated, including Decision Tree Regression (DTR), Support Vector Regression (SVR), Polynomial Regression (PR), Multilayer Perceptron (MLP), and Convolutional Neural Network (CNN). The results demonstrated that the CNN model achieved the superior performance, with the lowest Mean Absolute Error (MAE) of 7.87 μg/m3 and the highest Coefficient of Determination (R2) of 0.63. Although the model exhibited limitations in estimating peak concentrations during extreme haze episodes due to signal saturation, it demonstrated capability in monitoring seasonal trends and regional distribution. These findings highlight the efficiency of Deep Learning models and remote sensing data as valuable supporting tools for air quality monitoring in regions with limited ground-based observations.