Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 80-96
Abstract
Spelling correction (SC) is used to detect and correct misspelled words. SC is considered a fundamental task in various Natural Language Processing (NLP) applications, such as machine translation, chatbots, Optical Character Recognition (OCR) systems, etc. Currently, there exists a limited number of research works pertaining to spelling correction in low-resource languages, with a specific focus on evaluating the efficacy of Thai OCR in processing PDF documents. The issues pertaining to spelling correction in the Thai language involve not only the availability of benchmarking data, but also extend to text extraction from PDFs and the performance of models. In this study, we proposed a two-step spelling correction framework that includes detection and correction steps from image dataset. Experiment results revealed that in the error detection, Bi-LSTM revealed the highest performance and achieved an F1-score of 93.20%. In the error correction, Bi-LSTM with attention mechanism achieved F1-score of 86.31% and WangchanBERTa achieved F1-score of 81.36%. However, WangchanBERTa has a faster inference time than the Attention mechanism (40 times) and can reduce WER from 11.99% to 4.51%. The experiment results reveal that our proposed method effectively detects and corrects the Thai language.