Published in Data Science and Engineering (DSE) Record 2025 Vol. 6 No. 1 pp. 225-243
Abstract
This study presents a Retrieval-Augmented Generation (RAG) framework tai-lored for Thai legal question answering. The system integrates sparse re-trieval (BM25), dense retrieval (SentenceTransformer), and a hybrid ap-proach combin-ing both methods with dynamic weighting. To enhance con-textual relevance, a BGE-based re-ranking model was employed. Experiments were conducted on a Thai legal dataset (WangchanX-Legal-ThaiCCL-RAG), and performance was evaluated using Recall@K, Precision@K, MAP, and ROUGE-L. Results showed that while dense retrieval outperformed sparse retrieval in most metrics, the hybrid method—augmented by re-ranking—yielded the highest retrieval accuracy at low K values, with Recall@1 reaching 73.3%. Alt-hough this approach introduced additional processing time, the system re-mained near real-time in response. In the answer generation phase, the mod-el achieved an average ROUGE-L score of 0.4742 (0.6067 when excluding zero-score cases), indicating moderate alignment between generated and ref-erence answers. The findings suggest that hybrid retrieval with reranking improves legal information ac-cess in Thai, providing a reproducible baseline for future research in legal question answering for low-resource languages.