Volume: Vol. 5 2024


Issue: No. 1 March

Nathakit Keawtoomla, Arinya Pongwat, and Jakramate Bootkrajang

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 1-28

With the rapid growth of the food delivery industry, there is an urgent need to manage software effectively for sharing economy applications. One way to evaluate the effectiveness of these applications is by examining user concerns and feedback. We propose to use a Bi-LSTM-CNN model in a pipeline for automatic classification of the user concerns. The performances of other machine learning and deep learning models were studied and com-pared. The results showed that the proposed Bi-LSTM-CNN model achieved the highest accuracy score of 84.6%, outperforming the single deep learning models and the traditional machine learning models. Moreover, due to the imbalance nature of the collected data, the impact of data over-sampling technique for data imbalance problem was also evaluated. Inter-estingly, the interplays between the complex representation induced by the proposed Bi-LSTM-CNN model render the selected oversampling scheme e.g., SMOTE, unnecessary for our setting.

Sukanya Sawanoi and Pree Thiengburanathum

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 29-46

In recent times, there has been a significant increase in global and Thai electricity consumption. This surge has led people to seek ways to save on electricity costs, such as installing solar panels. However, it appears to be a solution addressing the symptom rather than the root cause, as people con-tinue to consume electricity at similar levels. Understanding the factors in-fluencing electricity usage is crucial for tackling the root cause, as it enables the reduction of activities or behaviors leading to excessive energy consump-tion. This research aims to investigate the factors influencing electricity con-sumption in university dormitories, specifically focusing on the electricity bills. Data was collected through a total of 243 surveys, rigorously verified and prepared for analysis. The survey yielded a total of 35 factors, which were then analyzed to identify their relationships with electricity consump-tion. The 16 factors were found to correlate with electricity usage based on Spearman's correlation, while 19 factors were identified through MI. To sim-plify the data and reduce complexity, EFA was employed, resulting in only 7 common factors from both Spearman's correlation and MI analyses. Each da-taset was utilized to build predictive models for electricity consumption us-ing five algorithms: SVM, MLP, KNN, DT, and LR. The baseline model, performing best in terms of learning efficiency with the dataset analyzed for correlation with electricity consumption using MI's 19 factors and SVM, achieved a testing accuracy score of 0.5762. To enhance the processing effi-ciency of the baseline model, parameter tunings were made for the SVM, with C set to 1.5, gamma set to 4.699, and using the “rbf” kernel. Post-training and evaluation, the adjusted model exhibited a testing accuracy score of 0.7353, indicating that parameter tuning positively affected the pre-dictive performance of the model for real-world scenarios. From the infor-mation gathered, it can be concluded that factors influencing electricity con-sumption include the number of notebooks operating on the Windows oper-ating system, the duration of computer usage for both learning and gaming, activities such as ironing or using a hair dryer combined with turning on the air conditioner for heat dissipation, knowledge about electricity usage (e.g., choosing electrical appliances labeled with the number 5), and finally, atti-tudes towards electricity usage.

Pinyawat Rattanayanyong and Karn Patanukhom

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 47-61

This research primarily aims to develop a system that uses machine learning technol-ogy to predict human height, weight, and body mass index (BMI) from a single full-body image. We proposes a novel method that utilizes the PiFuHD model to trans-form 2D images into 3D models, along with processes for feature extraction, feature selection, noise reduction of 3D point clouds, training and testing machine learning models. Data were collected from a survey of male and female Thai volunteers aged 18 to 65, without physical disabilities, for evaluating abilities of the models to pre-dict height, weight, and BMI. The effectiveness and accuracy of the machine learning methods were assessed using performance metrics such as mean absolute error (MAE). The results obtained from the testing set showed a MAE of 4.38 centimeters for height prediction, 8.56 kilograms for weight prediction, and 3.03 for body mass in-dex. This research opens avenues for researchers and interested parties to utilize the developed concepts and methods in creating applications or systems capable of effi-ciently predicting human height, weight, and body mass index from images.

Chanwit Chanton, Pairach Piboonrungroj, and Juggapong Natwichai

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 62-79

This study implements data quality assessment framework to rapid eco-nomic indicators. Due to the outbreak of the COVID-19 pandemic abruptly halted economic activities worldwide. Assessing its economic im-pact us-ing traditional economic indicators has proven insufficient for the ur-gent ana-lytical and decision-making needs. The advent of Big Data, charac-terized by its diverse sources and frequent reporting for real-time monitoring. However, a critical challenge is the absence of standardized data quality as-sessment frame-works. Neglecting data quality assessment while employing Big Data for decision- making may lead to erroneous decisions. This study evaluates rapid eco-nomic indicators, Apple Mobility Index, Global Normalcy Index, and Google search trends. An existing data quality assessment frame-work and data quality dimensions—accuracy, timeliness, and validity—are assessed by Talend Open Studio for Data Quality. Findings reveal the Global Normalcy Index as a promising rapid economic indicator for timeliness and validity. However, accuracy testing yielded inconclusive results due to its fluctuations. This highlights the need for a nuanced approach with consider-ing data character-istics. Future endeavors should diversify data quality di-mensions and refine the assessment framework to enhance data quality as-sessment efficiency.

Jukkrit Mengkaw and Pree Thiengburanathum

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 80-96

Spelling correction (SC) is used to detect and correct misspelled words. SC is considered a fundamental task in various Natural Language Processing (NLP) applications, such as machine translation, chatbots, Optical Character Recognition (OCR) systems, etc. Currently, there exists a limited number of research works pertaining to spelling correction in low-resource languages, with a specific focus on evaluating the efficacy of Thai OCR in processing PDF documents. The issues pertaining to spelling correction in the Thai language involve not only the availability of benchmarking data, but also extend to text extraction from PDFs and the performance of models. In this study, we proposed a two-step spelling correction framework that includes detection and correction steps from image dataset. Experiment results revealed that in the error detection, Bi-LSTM revealed the highest performance and achieved an F1-score of 93.20%. In the error correction, Bi-LSTM with attention mechanism achieved F1-score of 86.31% and WangchanBERTa achieved F1-score of 81.36%. However, WangchanBERTa has a faster inference time than the Attention mechanism (40 times) and can reduce WER from 11.99% to 4.51%. The experiment results reveal that our proposed method effectively detects and corrects the Thai language.

Meena Thanaklung, Waranya Mahanan, Sumalee Sangamuang, and Prompong Sungunnasil

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 97-110

The layout of a retail establishment has an essential part in attracting and maintaining clients, thereby influencing sales. This study focuses on the issue of customers in hardware stores seeking assistance as a result of insufficient organization. By utilizing purchase history and doing market basket research, a data-driven method is suggested to enhance the store design. Market basket analysis is a technique that derives association rules from consumer purchasing data, facilitating the identification of products that are connected with each other. The objective is to minimize the distance to get the items in the hardware store which will increase customer satisfaction by reorganizing the store layout using consumer transaction data.

Supanut Thiengburanatam, Waranya Mahanan, Sumalee Sangamuang, and Prompong Sungunnasil

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 111-122

Through voice data analysis, this research presents a novel deep-learning approach to predict customer age ranges in telesales. Utilizing the rich dataset from Mozilla’s ’Common Voice’ project, the study focuses on extracting vocal features using Librosa and building a model with TensorFlow and Keras. Based on LSTM layers, the model is trained to recognize patterns correlating vocal attributes with customer age. The research demonstrates the model’s efficiency through various performance metrics, aiming to enhance customer service personalization in telesales. This research presents a novel deep-learning approach to predict customer age ranges in telesales, utilizing the rich dataset from Mozilla’s ’Common Voice’ project. By extracting vocal features using Librosa and building a model with TensorFlow and Keras, this study shows that LSTM layers can effectively recognize vocal attributes correlating with customer age. The results, demonstrating a validation accuracy of 54.25%, underline the potential for enhancing personalized customer service through voice data analytics. This methodological innovation represents a significant step toward practical applications in customer relationship management with advanced machine learning techniques.

Pantaree Pitivaranun, Dussadee Praserttitipong, and Wijak Srisujjalertwaja

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 123-140

This study emphasizes the critical role of course learning outcomes, particularly in assessing student capability, mainly in the cognitive learning framework provided by Bloom's Taxonomy. In computer science education, aligning these outcomes with curriculum guidelines is important for program quality and relevance. The study introduces machine learning models, including Multinomial Naive Bayes, Logistic Regression, Random Forest, and Extreme Gradient Boosting (XG Boost), to predict and visualize course learning outcomes classification using radar charts. The primary aim is to establish a classification model aligning with ACM/IEEE undergraduate computer science program curriculum guidelines. Additionally, the study addresses the ambiguity inherent in Bloom's Taxonomy, where the same action verb may span multiple cognitive levels, potentially confusing in defining learning objectives across Familiarity, Usage, and Assessment domains. Through a semi-automated prototype, the study showcases a scalable and adaptable framework for visualizing learning outcomes classification results by radar charts. This framework is intended to benefit educators, curriculum developers, and accreditation bodies, enhancing the coherence and effectiveness of computer science undergraduate programs.

Supawit Ongkariyapong, Dussadee Praserttitipong, and Wijak Srisujjalertwaja

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 141-155

This independent study emphasizes how integrating semantic search and curriculum analysis into course recommendation systems facilitates the alignment of academic education with user’s wants. Nowaday, big data has become increasingly prominent in today's world which big data still increasing until becomes the massive volume. Consequently, it's becoming more difficult to accurately deliver the data that meet with user preferences. Therefore, implementing recommendation systems to filter data before delivering it to users can assist in meeting their needs effectively, through advanced natural language processing and semantic analysis techniques. This independent study has objective to enhance the recommendation system based on semantic search over traditional search. Moreover, users are navigated by course recommendation based on semantic search with better decision making.

Thanawat Lukuan, Sumalee Sangamuang, Prompong Sungunnasil, and Waranya Mahanan

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 156-168

Typically, businesses might considerably benefit from user behavior when developing their advertising tactics. Click-through Rate (CTR) is one of the most efficient metrics that provide insights into advertising effectiveness. Moreover, CTR analysis is also used to develop advertising tactics for online marketing. Since a person's lifestyle has changed from offline to online during the COVID-19 pandemic, online-to-offline (O2O) commerce has emerged. O2O commerce is an efficient business model that links offline business activities with online platforms, e.g., Facebook ads. In online situations, CTR analysis can predict the state or fact of something's being likely, the probability that something on an online review and website advertisements will be clicked. Firstly, this paper considers a problem of customer response in online advertising based on CTR prediction. Afterward, a research framework for CTR prediction based on customer response in online advertising using regression models, i.e.linear regression, support vector regression, multi-layer perceptron regression, and random forest regression, is proposed. Such methods only use certain parameters for learning and ignore temporal variance and changes in user behavior. The experiments evaluate the regression model’s accuracy using R-squared. The experimental results are visualized on scatter plots to describe the relationship between the number of predicted likes and actual likes. The R-squared of the random forest regression model is higher than the others, so the random forest regression model outperforms the other models in analyzing customer response in a tech company's Facebook ads.

Suttawee Lukuan, Prompong Sugunnasil, Sumalee Sangamuang, and Waranya Mahanan

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 169-206

The work presents the application of complex-valued deep learning for classifying microbial organisms, highlighting its significance for rapid pathogen identification crucial in healthcare. It explores the efficiency of complex-valued neural networks over traditional real-valued networks, focusing on efficiency, computational resource usage, and accuracy in genome sequencing classification. The research employs theoretical analysis and empirical testing, comparing the performance of complex-valued and real-valued models. Findings indicate that complex-valued CNNs offer advantages in encoding genomic sequences and processing efficiency. The study’s significance lies in its potential to advance pathogen classification methods, offering insights into the practical trade-offs between model complexity and computational efficiency, and contributing to the development of more effective tools for epidemic prevention and control

Sumana Ganne and Chartchai Doungsa-ard

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 207-213

The phenomenon of student attrition is a pressing issue for higher education institutions globally. Universities aim to maximize their graduation rates, but maintaining a balance between enrollment and graduation has been challenging for decades. It's critical for universities to understand the rates and reasons behind student attrition, as well as when students are most at risk of dropping out, to implement effective strategies to address this issue. Most dropouts occur early in university life, often due to poor academic performance. This independent study aims to use data to identify factors affecting student performance and create a predictive model for their performance in advanced courses. The results will inform institutional policies and strategies to improve facultystudent interactions and increase retention rates. Identifying at-risk students early and creating support pathways are crucial steps toward reducing student attrition.

Onthana Khrueabunma and Arinya Pongwat

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 214-225

According to the Global Wellness Institute, wellness tourism is expected to grow substantially, with an estimated 21% increase by 2025. Online reviews play a vital role in shaping consumer behavior and impact-ing purchasing choices. The primary objective of this study is to examine consumer satisfaction in wellness accommodations in Thailand. To achieve this, a novel approach integrating zero-shot modeling is employed. The analysis includes five crucial aspects: hotel service, location, comfort, food, and cleanliness. The study findings indicate that customers prioritize service as the most important characteristic of wellness accommodation. The implications of wellness accommodations are con-siderable, as they directly impact the outcome of wellness tourism and the features of hotels. By utilizing this knowledge, establishments can enhance their products to more effectively align with guest preferences, result-ing in increased customer satisfaction and sustained loyalty.

Siriyaporn Rattana and Pree Thiengburanathum

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 226-241

Goldfish are popular ornamental fish, especially in Thailand, which is one of the top importers of goldfish globally. Simultaneously, the industry for exporting fish is also among the top in the world. However, the exchange of fish from various sources makes them susceptible to diseases such as Cyprinid Her- pesvirus 2, a significant disease in goldfish with a mortality rate of 50% to 100%. The purpose of this research is to design a predictive model to identify Cyprinid Herpesvirus 2 infections in goldfish, utilizing data gathered from 13 ornamental fish shops across 5 districts in Chiang Mai province. The dataset was imbalanced, with the number of non-infected samples (PCR=0) being higher than the infected samples (PCR=1). Therefore, bootstrapping was used to increase the number of infected samples by 54 to balance the dataset. Subsequently, the Mutual Infor- mation method was employed to determine the relationship scores between fea- tures and the infection variable (MI Score). The Fixed Threshold method was employed to select features most relevant to the infection variable from a total of 46 features based on MI Scores ranging from 0 to 0.24. Using the K-Nearest Neighbors model with n_neighbors=2, it was found that an MI Score of 0.19 was most suitable for this dataset. The features with an MI Score greater than or equal to 0.19 were the pH value of water, The water temperature, Total length of the fish (CM), and Length of the fish tank (CM). These variables were then used to train several models, including Decision Tree Classification model, Random For- est Classification model, Logistic Regression model, and K-Nearest Neighbors model. The Random Forest Classifier emerged as the most effective model, with training data results of {Accuracy: 99.992757%, Recall: 99.993054%, Precision: 99.992757%, F1 Score: 99.992748%}, and test data results of {Accuracy: 93.333333%, Recall: 91.666667%, Precision: 93.333333%, F1 Score: 93.055556%}.

Arreeya Suwanmosi and Jeerayut Chaijaruwanich

Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 243-260

Online Consumer Reviews (OCR) significantly influence consumers' purchasing decisions for new products. Consequently, companies seek effective ways to analyze consumer opinions. The pet market has seen remarkable growth and is of particular interest, especially within the context of "Pet Humanization", where pets are cared for with love and attention akin to family members. This study aims to analyze customer reviews and discover consumer insights related to high-tech pet products, specifically automatic pet feeders. Using text mining techniques and Natural Language Processing (NLP), 15,558 customer reviews from the e-commerce market were collected to uncover consumer trends and preferences. This research utilizes Latent Dirichlet Allocation (LDA) for topic modeling to analyze customer opinions, with the results visualized using tools such as WordCloud and pyLDAvis. The findings reveal three main topics: Functionality, Performance, and Value and Quality Assessment. The study indicates that consumers prioritize product integration into daily life, reliability and ease of maintenance, and the overall quality and value of the product. Keywords: Online Customer Review, Text Mining, Topic Modeling, Pet Product, Consumer Insights