Da Sun and Jakramate Bootkrajang
Published in Data Science and Engineering (DSE) Record 2025 Vol. 6 No. 1 pp. 582-606
Abstract
This study verifies a counter-intuitive hypothesis through empirical research: injecting random label noise (Noise Completely at Random, NCAR) into the training data can be used as a robust implicit regularizer of the classification model. The traditional view is that label noise will reduce the performance of the model; however, we propose that in a high-capacity model based on limited training data, controllable label noise can prevent the model from overfitting to the training data. We used 10 different two-classified data sets (UCI/OpenML) to verify this hypothesis, and set the logic regression, decision tree and multi-layer sensor (MLP) and the standard explicit regularizer (Dropout, L1, L2) at the noise level of {0%, 1%, 5%, 10%, The benchmark test was carried out under the condition of 15%}. Our results (verified under 10 random seeds and hierarchical segmentation conditions) show that label noise can usually bring better generalization performance, especially in the case of low signal-to-noise ratio (SNR) and serious category imbalance. The evidence we provide shows that the noise injection forces the optimized landscape to a flatter minimum value, thus improving the accuracy and F1-score of the test set.