Published in Data Science and Engineering (DSE) Record 2024 Vol. 5 No. 1 pp. 226-241
Abstract
Goldfish are popular ornamental fish, especially in Thailand, which is one of the top importers of goldfish globally. Simultaneously, the industry for exporting fish is also among the top in the world. However, the exchange of fish from various sources makes them susceptible to diseases such as Cyprinid Her- pesvirus 2, a significant disease in goldfish with a mortality rate of 50% to 100%. The purpose of this research is to design a predictive model to identify Cyprinid Herpesvirus 2 infections in goldfish, utilizing data gathered from 13 ornamental fish shops across 5 districts in Chiang Mai province. The dataset was imbalanced, with the number of non-infected samples (PCR=0) being higher than the infected samples (PCR=1). Therefore, bootstrapping was used to increase the number of infected samples by 54 to balance the dataset. Subsequently, the Mutual Infor- mation method was employed to determine the relationship scores between fea- tures and the infection variable (MI Score). The Fixed Threshold method was employed to select features most relevant to the infection variable from a total of 46 features based on MI Scores ranging from 0 to 0.24. Using the K-Nearest Neighbors model with n_neighbors=2, it was found that an MI Score of 0.19 was most suitable for this dataset. The features with an MI Score greater than or equal to 0.19 were the pH value of water, The water temperature, Total length of the fish (CM), and Length of the fish tank (CM). These variables were then used to train several models, including Decision Tree Classification model, Random For- est Classification model, Logistic Regression model, and K-Nearest Neighbors model. The Random Forest Classifier emerged as the most effective model, with training data results of {Accuracy: 99.992757%, Recall: 99.993054%, Precision: 99.992757%, F1 Score: 99.992748%}, and test data results of {Accuracy: 93.333333%, Recall: 91.666667%, Precision: 93.333333%, F1 Score: 93.055556%}.