Patiphon Ongartittichai and Phasit Charoenkwan

Published in Data Science and Engineering (DSE) Record 2025 Vol. 6 No. 1 pp. 365-380

PDF

Abstract

This research aims to compare the efficiency of algorithms for detecting and correcting typos in Thai, considering accuracy and processing time, es-pecially the combination of word cutting methods and typo detection algo-rithms, to find the most suitable approach for developing Thai natural lan-guage processing tools (Thai NLP). The data used in the experiment con-sisted of 3 Thai datasets: Thai Toxicity Tweet, Wisesight Sentiment, and ThaiSum, which are human-generated texts from both social media and news articles. The data was then prepared and word cutting was performed using the newmm, deepcut, and attacut processes. Then, typos were checked using the Levenshtein Distance, Hunspell, Peter Norvig, and Word2Vec al-gorithms. The experimental results showed that the combination of word cutting and typo detection algorithms between attacut and Peter Norvig gave the best results in terms of accuracy, while newmm and Hunspell gave the best results in terms of speed. Each method has its own advantages and disadvantages. Therefore, the choice of use should depend on the objec-tives, such as accuracy or speed. In addition, the research also presents a re-usable experimental framework, which is useful for developers and re-searchers who want to evaluate or develop Thai typo detection systems in the future.