Issue: October 2021 Vol. 2 No. 1


Wachiranun Sirikul and Trasapong Thaiupathump

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 14-18

Thailand is a middle-income country where the road traffic injury crisis has been one of the most serious public health concerns. Currently, the machine learning (ML) algorithms are widely used for public health predictive analyt-ics. Therefore, we developed the Multi-layer perceptron (MLP) classifier from the road traffic accident driver data in Thailand that aim to classify a high-risk driver who had severe injuries from road traffic accidents. Howev-er, the imbalanced data was a typical problem in public health data and also caused an “accuracy paradox” that the model intended to predict a majority class. Accurately detecting minority class was important especially in the public health data because it was associated with high impact events and se-rious adverse outcomes. Since the imbalanced data is unavoidable according to the nature of public health data. The rebalanced strategies or other data approaches were applied to encounter this problem. Subsequently, the over-sampling techniques were significantly improved discrimination performanc-es of models comparing with under-sampling or without rebalancing ap-proach.

Thanapon Chaijunla and Phimphaka Thaiupathump

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 1-13

This study aims to develop the models for predicting the retail pricing of jewelry by using data source from online retail diamond ring stores. There are 2,206 records of ring data and 187,821 records of loose diamond data. This study develops and compare a performance of three models consist of Multiple Linear Regression (MLR), Random Forest (RF), and Deep Neural Network (DNN). The evaluation metrics used for comparing algorithms are accuracy of prediction using MAE and MAPE. The results show that MAE for the ring price prediction of MLR, RF, and DNN are $688.36, $235.33, and $273.00, respectively. In addition, MAE for a diamond price prediction of MLR, RF, and DNN are $3254.03, $450.44, $445.94, respectively. The re-sults show that RF and DNN give higher accuracy rate than MLR. However, the accuracy rate of RF and DNN are slightly different.

Panithan Intharawicha and Phisanu Chiawkhun

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 19-23

This independent study aims to analyze an online social network in a famous website called “Pantip” by presenting the trend of customer interest on choosing the active ingredients in facial moisturizer products. The investigation was done by collecting customer’s opinions on the active ingredients from threads in beauty forum of Pantip website by using Python syntax. The data collecting were analyzed by using tokenization and word count. It was found that the top three most popular active ingredients are collagen, retinol, and niacinamide respectively. By collecting keywords referring to these active ingredients, the customers mentioned collagen in terms of moisture, brightness, and resilience similarly, retinol was mentioned with related to anti-aging, pore firming, and moisture. Lastly, niacinamide was commented about moisture, resilience, and pore firming. When data were plotted and presented by word cloud technique. It can be implied that the top five most popular key words showed similarity in term of types but difference in order. However, the statistical analysis by Friedman test showed that the rank of the key words of the active ingredients showed no considerable difference. Therefore, it can be summarized that the customers commented about the top three active ingredients with no significant difference.

Ekkarach Sunanta and Pruet Boonma

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 24-28

The objective of this independent study was to develop a system to summarize the health status of multiple relational databases using ETL (Extract Transforms Load) techniques which extract data from various sources, transform and load it into a master database. Then, the health status data is analyzed and presented by using the business intelligence tool. This system provides a convenient and time-saving platform for observing the availability and identifying anomalies of each database. Furthermore, this system allows the administrators to detect problems that can occur, such as disk space full, which can cause the system to stop and, consequently, damage the system's reliability. The proposed system is evaluated in two aspects. First, the time spent making a summary of the health of the database to check abnormal from these databases is compared between the existing system and the proposed system. Second, the ac-curacy and completeness of the information provided by the proposed approach are compared with that of the current system. The evaluation found that the pro-posed system can meet the expected time saving, contains complete information as expected, and has verified that the data is correct with the fact data in each da-tabase source.

Thanawat Kaewwiroon and Juggapong Natwichai

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 29-35

This independent study aims to develop a data pipeline system that is able to preserve the privacy for data ingression in the data pipeline. The system developed using the k-anonymity method with generalization and suppression. The precision of information lossy is concerned and minimized process of Preferred Minimal Generalization Algorithm (MinGen). The system will first calculate the data precision for all possible of the domain generalization hierarchies. Then, process the satisfied k value for the data set for which pattern of generalization level. Finally, the data in database system will be transform to the privacy preservation. The demographic synthesized dataset is generated, and domain categorizes, and its level of quasi-attributes are created which prepared for evaluate the data pipeline system. The indicator of success in this independent study are processing time with the different amount of data records which satisfied the k value. For the results, the more data records spend less processing time for the k value satisfied. Because of the more data records increasing the possible of k records that is similar to others and satisfied the k value. Thus, the Privacy Preservation for Data Ingression in Data Pipeline system which developed in the independent study can process data to satisfied k-anonymity technique which also minimized loss of data precision for demographic data in data pipeline.

Sutthinun Peauut and Pruet Boonma

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 36-44

This study aims to develop a data warehouse system that integrates data from many sources of the Provincial Waterworks Authority (PWA). The system per-forms Extract-Transform-Load (ETL) and displays a map showing the incorrect water meter coordinates between the monthly water meter readings and the water meter installation points from the PWA’s Geographic Information System (GIS). Therefore, the PWA administrators can inspect and plan a survey of water meter installation sites to continue improving the information in the GIS. The system developed in this research was evaluated by comparing the accuracy of the differ-ence between the monthly water meters readings coordinates and the water meter coordinates from the GIS together with the results of creating abstract visualiza-tions in the form of a plan and a graph showing the water meters route with high incorrect coordinates. This approach allows the users to use it as information in planning the survey of the water meter installation site. The evaluation also shows a comparison of working time from the existing process with the newly devel-oped system. It was found that the newly developed system takes less time to op-erate than the user’s actual operation.

Kritchanut Chaimongkon and Juggapong Natwichai

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 45-48

This independent study analyzes the factors affecting customer priorities us-ing multiple data sources for the International Accreditation Services busi-ness with a conceptual representation of the importance of this independent study. As the current offering model does not prioritize and take into ac-count a number of factors, this work can result in a more effective offering. Factor analysis by visualizing information is studied to see the overall pic-ture of the data as a dimension. The model is developed into Microsoft Pow-er BI, thus it is possible to find observations from the various type of data. We have found the following insight, i.e. the food factory business had the highest number of requests for certification, followed by electronic parts manufacturing and energy sector. In terms of financial viability factor, if we observed from those who are interested in purchasing the certification ser-vice, it was found that the ability to spend on investments was approximate-ly four times higher than those of the non-interested group. These factors together with registered capital, income, and asset value can futher prioritize the service offering in the future.

Zihao Zhao and Phisanu Chiawkhun

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1 pp. 49-54

This research aims to measure the quality of Chiang Mai University graduates, construct the model which can predict income accurately and find the variables effected to Chiang Mai University graduates’ quality. Through data collection and integration, we got the students’ data in Chiang Mai University from academic year 2012-2014.Then we brought in data and used three machine learning models (artificial neural networks, logistic regression, and support vector machines) to perform multiple classifications. All three have relatively good prediction results with good accuracy. The results show us that theincome ofgraduates ofChiang Mai University isnormally distributed. Most of the graduates have a medium income, and a small number of people earn high and low incomes. The best way to increase income for students who have just entered university is to improve their English scores and choose medicalrelated majors. For senior students, choosing to study for a higher degree and maintainingahighGPAisaveryeffectivewaytoincreasetheirincome.

Chartchai Doungsa-ard and Patcharaprapa khamkhiaw

Published in Data Science and Engineering (DSE) Record 2021 Vol. 2 No. 1

Software Maintenance Fixed Time Classification from Code Smell