Pusit Seephueng and Prompong Sugunnasil

Published in Data Science and Engineering (DSE) Record 2025 Vol. 6 No. 1 pp. 194-210

PDF

Abstract

Hallucination in large language models (LLMs) presents a significant challenge in medical applications, where accuracy and reliability are paramount. This study investigates reasoning hallucinations in LLMs and proposes ensemble methods to mitigate their occurrence. Using the False Confidence Test (FCT) dataset from Med-HALT, we evaluate six individual medical LLMs and introduce two ensemble techniques: Weighted Voting and Cascade Ensemble. Our findings indicate that individual models exhibit varied accuracy, with some prone to generating hallucinations. The ensemble methods significantly improve performance, with Cascade Ensemble achieving the highest accuracy (30.23%) and pointwise score (24.12), effectively reducing hallucination-induced errors. While Weighted Voting provides a balance between efficiency and accuracy, it initially suffers from unreliable model contributions. These results highlight the potential of structured ensemble techniques to enhance the robustness of medical LLMs, offering a viable approach for mitigating reasoning hallucinations in clinical decision support systems.