Current issue: Vol. 6 2025
Nuttawut Thuayhanruksa and Pree Thiengburanathun
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 1-30
This paper explores the application of various natural language processing (NLP) models for sentiment analysis on financial news articles sourced from Thai financial news websites, focusing on Thai-language data. The study evaluates machine learning and deep learning models, including Lo-gistic Regression, Bidirectional Long Short-Term Memory (Bi-LSTM), Con-volutional Neural Networks (CNN), WangChanBERTa, OpenAI’s GPT-3.5 and OpenThaiGPT. The models' performance is assessed using accuracy, precision, recall, and F1-score. The findings reveal that the Fine-tuned WangChanBERTa model achieved the highest accuracy of 0.84 on the test-ing set, demonstrating its superior ability in classifying sentiment in Thai financial news. BI-LSTM and CNN models also performed well, with test-ing accuracies of 0.781 and 0.791 In contrast, OpenAI’s GPT-3.5 and Open-ThaiGPT, which lacked fine-tuning and optimized prompts due to computa-tional constraints, exhibited practical limitations in resource-constrained settings.
Kitichart Nukaew and Arinya Pongwat
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 31-55
Massive Open Online Courses (MOOCs) have seen continuous growth in popularity and rapid expansion. In the instructional design process, receiv-ing feedback from learners is crucial, as it helps tailor the content to better meet learners' needs. The application of NLP models in analyzing learners' feedback is an effective approach for extracting insights from a large volume of comments related to the courses. These models can categorize feedback into three distinct categories: course, instructors, and assessments. Addi-tionally, the models can predict the sentiment of the feedback, determining whether it is positive or negative. In developing these models, semi-supervised learning techniques have been employed to address the chal-lenge of limited data availability. Experimental results indicate that, for feedback categorization, a GRU model combined with tri-training with dis-agreement yields the highest prediction accuracy. Conversely, for sentiment analysis, a GRU model combined with tri-training produces the best out-comes.
Kamonwit Makkaphan, Prompong Sungunnasil, Waranya Mahanan, and Sumalee Sangamuang
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 56-90
Online customer reviews represent a valuable source of information for businesses seeking to understand consumer perceptions and preferences. This paper introduces a framework for competitive positioning analysis by leveraging these online reviews and sentiment analysis. The framework employs Natural Language Processing (NLP) techniques in three phases: 1) identifying key themes and topics from reviews using Latent Dirichlet Allocation (LDA); 2) extracting product features through zero-shot text classification; and 3) visualizing competitive positioning via Net Promoter Score (NPS) and sentiment analysis plots. A case study on Amazon’s laptop market revealed a moderate correlation (58.8%) between NPS and sentiment analysis, suggesting potential limitations in feature classification accuracy. While the study demonstrates the value of NLP for analyzing online reviews, it also emphasizes the need for improved feature recognition methods and more robust datasets to enhance the precision of competitive positioning analysis.
Manaschai Aonon and Phasit Charoenkwan
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 91-130
This research presents a comprehensive framework for analyzing customer behavior in walking street markets using advanced person re-identification techniques. We deployed dual CCTV cameras at strategic points along a 200-meter section of a walking street market in Chiang Mai, Thailand, to track customer movements and analyze behavioral patterns. Our methodol-ogy comprises three main components: (1) a novel segmentation-enhanced multi-region feature extraction framework combining YOLOv11 segmenta-tion with Swin Transformer, (2) a robust person re-identification approach with PCA-enhanced feature matching, and (3) detailed customer behavior analysis based on movement patterns, speeds, and interactions. Our feature extraction method achieves 92.31% Rank-1 accuracy and 59.62% mAP, significantly outperforming traditional approaches. Using the re-identification results, we identify five distinct customer behavior types (Goal-Oriented, Browsing, Lingering, Focused, and Brief Visitors) with ac-tionable insights for market management. This research contributes both methodological advances in per-son re-identification and practical applica-tions for retail analytics in dynamic public spaces.
Noratap Muangudom and Karn Patanukhom
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 131-167
In recent years, Large Language Models (LLMs) have demonstrated signifi-cant potential in various applications, including healthcare, education, and customer support. This study investigates the integration of LLMs into group chat environments to facilitate medical counseling between doctors and heart disease patients. Traditional chatbot systems primarily operate in one-on-one interactions, which can lead to redundant queries and ineffi-ciencies in medical consultations. This research introduces a novel chatbot system designed for group chat settings, allowing multiple users and medi-cal professionals to interact seamlessly within the same conversation.The chatbot system retrieves medical knowledge from a predefined document database using an information retrieval model to ensure responses are rele-vant and accurate. A verification mechanism is integrated, enabling doctors to review and validate chatbot-generated responses before they are present-ed to patients. The study employs hypothesis testing and real-world evalua-tions to measure chatbot performance across three key dimensions: re-sponse accuracy, response speed, and user satisfaction. Experimental re-sults indicate that group chat environments improve communication effi-ciency, reduce repetitive queries, and enhance patient engagement compared to traditional one-on-one chatbot interactions.Furthermore, user feedback highlights the strengths and limitations of the proposed system. While the chatbot successfully provides relevant medical information, challenges re-main in ensuring response accuracy, reducing response time, and improving contextual understanding in group conversations. Future work will focus on refining chatbot algorithms, enhancing natural language processing capa-bilities, and expanding the medical knowledge base to support a wider range of healthcare scenarios. This research underscores the potential of LLMs in transforming digital healthcare support, making medical consulta-tions more efficient, accessible, and collaborative.
Xiaofan Zhou and Jakramate Bootkrajang
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 168-184
This study presents a robust framework for automated extraction and performance evaluation of video interaction metrics across major Chinese social media platforms (Bilibili, Douyin, Xiaohongshu) characterized by heterogeneous interface designs. Leveraging a synergistic combination of YOLOv8 object detection and Optical Character Recognition (OCR), the proposed system addresses platform-specific challenges in identifying engagement indicators (likes, comments, shares, views etc.) through icon localization and numerical extraction. A dataset of 250 annotated screenshots encompassing diverse interface variations was utilized to train and validate the deep learning model, achieving mean average precision (mAP@50) of 99.5% across all interaction categories. The extracted metrics were standardized and validated against third-party Key Performance Indicators (KPIs) from commercial analytics platforms (Pugongying, Huahuo and Xingtu), demonstrating 98% alignment in performance classification. Hyperparameter optimization and spatial pyramid pooling enhancements enabled cross-platform generalization, with error analysis revealing OCR misinterpretations (e.g., unit omission in "万" (10k) as the primary accuracy limitation. The framework advances social media analytics by enabling scalable, platform-agnostic performance benchmarking, offering practical value for content optimization, advertising compliance verification, and engagement trend analysis in the evolving short video ecosystem.
Natchayar Saosuwan and Karn Patanukhom
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 184-193
The healthcare sector is becoming more competitive, requiring businesses to understand consumer needs through sentiment analysis of feedback. This study analyzed feedback from Sriphat Medical Center to assess satisfaction (satisfied/dissatisfied) across eight as-pects, including service process, staff behavior, and medical expertise. Using Natural Lan-guage Processing (NLP) and machine learning with Bag-of-Words and the Term Frequency-Inverse Document Frequency (TF-IDF) techniques, the best-performing model was a linear SVM with 95.8% accuracy in satisfaction classification and 77.4% in aspect classification.
Pusit Seephueng and Prompong Sugunnasil
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 194-210
Hallucination in large language models (LLMs) presents a
significant challenge in medical applications, where accuracy and reliability
are paramount. This study investigates reasoning hallucinations
in LLMs and proposes ensemble methods to mitigate their occurrence.
Using the False Confidence Test (FCT) dataset from Med-HALT, we
evaluate six individual medical LLMs and introduce two ensemble techniques:
Weighted Voting and Cascade Ensemble. Our findings indicate
that individual models exhibit varied accuracy, with some prone to generating
hallucinations. The ensemble methods significantly improve performance,
with Cascade Ensemble achieving the highest accuracy (30.23%)
and pointwise score (24.12), effectively reducing hallucination-induced
errors. While Weighted Voting provides a balance between efficiency and
accuracy, it initially suffers from unreliable model contributions. These
results highlight the potential of structured ensemble techniques to enhance
the robustness of medical LLMs, offering a viable approach for
mitigating reasoning hallucinations in clinical decision support systems.
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 211-224
Nowaday, there are over 921 listed companies on the Stock Exchange of Thailand, with a total market capitalization of 17,430,644.71 billion THB as of the end of 2023. These listed companies can issue bonds (debt securities) for public sale, providing Thai investors with diverse financial investment options. In 2023, more than 4,753,851 billion THB was raised through initial bond offerings. Despite stringent oversight by the Securities and Exchange Commission of Thailand (SEC), some companies have faced financial failures, leading to delisting and defaults on bond payments, which have significantly harmed numerous investors. Most companies that defaulted on bond payments lacked credit ratings from credit rating agencies, which are crucial for investors to assess the risk of financial failure.
As of August 2024, only 175 listed companies on the Stock Exchange of Thailand had received credit ratings from Tris Rating Co., Ltd. This highlights the importance of analyzing and estimating credit ratings for listed companies based on their financial statements to support Thai investors in evaluating financial investments. The findings of this research aim to provide a valuable tool for investors in analyzing investments in financial instruments issued by listed companies. The result of study show in a tabular format including machine learning model performance and training parameters.
Pimchanok Promwang and Pruet Boonma
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 225-243
This study presents a Retrieval-Augmented Generation (RAG) framework tai-lored for Thai legal question answering. The system integrates sparse re-trieval (BM25), dense retrieval (SentenceTransformer), and a hybrid ap-proach combin-ing both methods with dynamic weighting. To enhance con-textual
relevance, a BGE-based re-ranking model was employed. Experiments were
conducted on a Thai legal dataset (WangchanX-Legal-ThaiCCL-RAG), and performance was evaluated using Recall@K, Precision@K, MAP, and ROUGE-L.
Results showed that while dense retrieval outperformed sparse retrieval in most metrics, the hybrid method—augmented by re-ranking—yielded the highest
retrieval accuracy at low K values, with Recall@1 reaching 73.3%. Alt-hough this approach introduced additional processing time, the system re-mained near real-time in response. In the answer generation phase, the mod-el achieved an average ROUGE-L score of 0.4742 (0.6067 when excluding zero-score cases), indicating moderate alignment between generated and ref-erence answers. The findings
suggest that hybrid retrieval with reranking improves legal information ac-cess in Thai, providing a reproducible baseline for future research in legal question
answering for low-resource languages.
Chattrapat Poonsin and Pruet Boonma
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 246-272
Customer segmentation is a vital component of data-driven marketing, ena-bling businesses to understand customer behavior and enhance strategic de-cision-making. This study explores an efficient segmentation approach us-ing Recency, Frequency, and Monetary (RFM) analysis, combined with mul-tiple clustering techniques, to identify optimal customer groups. Four clus-tering approaches were implemented and compared centroid-based density based, distribution-based, and hierarchical clustering (Agglomerative). Each of these algorithms were evaluated based on its ability to form well-separated and meaningful clusters, with silhouette score as the primary per-formance metric. The dataset was standardized before applying the cluster-ing models to ensure comparability. The results reveal that different algo-rithms exhibit varying strengths depending on the underlying data struc-ture. K-Means demonstrated efficiency in partitioning customers into dis-tinct groups but struggled with non-spherical clusters. DBSCAN effectively identified outliers but was sensitive to parameter tuning. GMM provided flexibility by modeling cluster probability distributions, making it suitable for overlapping customer behaviors. Hierarchical clustering offered an in-terpretable structure but required significant computational resources for large datasets. Overall, the findings highlight the importance of selecting an appropriate clustering technique for customer segmentation based on data characteristics. This study provides valuable insights for businesses aiming to develop marketing strategies through data-driven segmentation.
Poompatai Muennamnor and Pruet Boonma
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 273-295
Refrigerant leaks from cooling systems can harm the environment and cost businesses money. Current ways to find leaks can be slow, expensive, and not always accurate. This project uses machine learning to create a better way to detect refrigerant leaks by listening to the sounds they make. The goal is to develop a system that can automatically and cheaply detect leaks early on, reducing environmental damage and saving businesses money. The system uses a microphone to record sounds, then a computer program analyzes the sounds to identify leaks. By using sound analysis, the system can tell the difference between normal sounds and the sounds of a refrigerant leak. This helps catch leaks early, lowers maintenance costs, and reduces greenhouse gas emissions.
Natchar Pongsri and Nasi Tantitharanukul
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 295-315
This research presents a predictive model for determining the next-day price direction of EUR/USD in the Binary Options market. The study utilizes technical indicators and price data over a 10,000-day span, collected from TradingView, and applies machine learning techniques particularly an ensemble classification framework combining CNN, LSTM, SVM, and XGBoost models. A total of 23 features were engineered from candlestick data and popular indicators such as RSI, MACD, ATR, and EMA. Statistical analysis ensured data quality and distribution symmetry. Model performance was evaluated using accuracy, F1 score, and ROC-AUC metrics. The resulting ensemble model outperformed individual models in predictive accuracy and stability. This research contributes to the development of automated trading systems and serves as a foundation for further work in financial time series forecasting using machine learning.
Kridsanaphon Suksan and Paskorn Champrasert
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 316-338
This independent study presents a system for object detection and localization using aerial imagery captured by drones in search and rescue operations. Generally, higher drone altitude gives greater area coverage, but reduces detection accuracy. While a lower altitude improves accuracy, but requires more search time. Lacking guidance on optimal altitude information, this study explores the various detection performances at different flight altitudes to enhance operational efficiency. Since altitude impacts both image quality and detection accuracy, image resolution is also examined as a key factor in system performance. The study evaluates the YOLOv11 algorithm for detection in aerial images, using clothing as a human proxy to address ethical and data collection constraints. Performance was assessed using Mean Average Precision, Precision, Recall, and Time along with, derived metrics like Efficiency Score and Missing Rate. The geolocation deviation is also measured. Findings indicated that increasing altitude reduces model performance but can be compensated by using a higher resolution image. For missions requiring high detection accuracy, the lowest altitude flights yield the best results. In contrast, more time-constrained operations can benefit from higher altitude but need more computation resources. In general, the study suggests a flight altitude of 40 meters with 1080×720 resolution as the most efficient altitude. At 40 meters, detection accuracy slightly decreases, but area coverage and computation speed improve significantly by roughly three times with the top Efficiency Score and lowest Missing Rate.
Pattadon Thepkan and Jakarin Chawachat
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 339-356
This study aims to develop a system for estimating the portion size and energy of Thai food from images using deep learning techniques. The proposed system supports dietitians and health-conscious individuals by enabling automated and accurate food intake assessment. The system consists of two main components: (1) object detection using YOLOv11 to simultaneously identify food items and reference coins in an image, and (2) food weight estimation using ResNet101, with coin objects serving as physical references for real-world scaling. The estimated food weight is then used to calculate nutritional values based on a Thai food database. Experimental results demonstrate that annotating object boundaries with Smart Polygon significantly improves model accuracy and stability compared to the traditional Bounding Box method, yielding higher Precision, Recall, F1-score, and mAP. Among the tested models, ResNet101 with coin references achieved the best weight estimation performance, with a Mean Absolute Error (MAE) of 71.12 grams and Root Mean Squared Error (RMSE) of 91.56 grams. This system is suitable for real-world applications in hospitals, restaurants, and personal nutrition tracking.
Poosana Thassanavisut and Sakgasit Ramingwong
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 357-364
This research develops an English learning platform that utilizes a combination of three machine learning models for grammatical topic classification in language assessment. English serves as one of the most crucial languages in today's world, particularly in education, work, and communication. However, learning and developing English language skills remains a significant challenge for many learners, especially in countries where English is not the primary language. Firstly, this independent study aims to address these challenges by developing a comprehensive English learning platform that incorporates advanced machine learning techniques. Secondly, the platform employs three distinct machine learning approaches: facebook/bart-large-mnli, Logistic Regression, and DeBERTa for automated grammatical topic assignment to examination questions. Finally, the empirical results demonstrate that the developed platform effectively enables users to assess their English proficiency according to the CEFR (Common European Framework of Reference for Languages) standards, while providing appropriate skill evaluation across various grammatical topics.
Patiphon Ongartittichai and Phasit Charoenkwan
Published in
Data Science and Engineering (DSE) Record 2025 Vol.
6 No.
1 pp. 365-380
This research aims to compare the efficiency of algorithms for detecting and correcting typos in Thai, considering accuracy and processing time, es-pecially the combination of word cutting methods and typo detection algo-rithms, to find the most suitable approach for developing Thai natural lan-guage processing tools (Thai NLP). The data used in the experiment con-sisted of 3 Thai datasets: Thai Toxicity Tweet, Wisesight Sentiment, and ThaiSum, which are human-generated texts from both social media and news articles. The data was then prepared and word cutting was performed using the newmm, deepcut, and attacut processes. Then, typos were checked using the Levenshtein Distance, Hunspell, Peter Norvig, and Word2Vec al-gorithms. The experimental results showed that the combination of word cutting and typo detection algorithms between attacut and Peter Norvig gave the best results in terms of accuracy, while newmm and Hunspell gave the best results in terms of speed. Each method has its own advantages and disadvantages. Therefore, the choice of use should depend on the objec-tives, such as accuracy or speed. In addition, the research also presents a re-usable experimental framework, which is useful for developers and re-searchers who want to evaluate or develop Thai typo detection systems in the future.