## JongHyuk Lee* , Mihye Kim** , Daehak Kim* and Joon-Min Gil**## |

Variable | Data cleaning | Cases used |
---|---|---|

Dependent variable | ||

Registration | Enrolled and expelled students. | All cases |

Independent variable | ||

GPA | Calculated grade point average for each student | All cases |

Age | Cases #2–#5 | |

Semester | Cases #2–#5 | |

Sex | Cases #2–#5 | |

Engagement in club activities | Cases #3–#5 | |

Nationality | Koreans and foreigners | Case #4 |

Parental address | Nearby and distant | Case #4 |

Number of consultations | Case #4 | |

Number of volunteer activities engaged in | Case #4 | |

Number of surveys completed evaluating satisfaction with extra-curricular activities | Cases #4, #5 | |

Number of surveys completed evaluating satisfaction with the department | Cases #4, #5 | |

Extracurricular activities score | Cases #4, #5 | |

Engaged in freshman camp activities | Cases #4, #5 |

As shown in Table 2, we used the independent variables from five cases to explore how the evaluations changed according to the characteristics and numbers of independent variables employed for dropout prediction.

Case #1: GPA

Case #2: Case #1 + {age, semester, sex}

Case #3: Case #2 + {engagement in club activities}

Case #4: Case #3 + {nationality, parental address, extracurricular activity score, number of volunteer activities, number of surveys completed evaluating satisfaction with extracurricular activities, number of surveys completed evaluating satisfaction with the department, number of consultations, and engagement in freshman camp activities}

Case #5: Case #3 + {extracurricular activity score, number of surveys completed evaluating satisfaction with extracurricular activities, number of surveys completed evaluating satisfaction with the department, number of consultations, and engagement in freshman camp activities}

Case #5 was derived from Case #4 (i.e., full feature set) by excluding independent variables (i.e., nationality, parental address, and number of volunteer activities) that did not significantly affect dropout, as revealed by ANOVA (Table 3). ANOVA can be used to improve model performance when an independent variable is included, depending on the difference between the null and residual deviance. We found that engagement in extracurricular activities significantly reduced dropouts.

Table 3.

df | Deviance | Residual | Pr(>Chi) | ||
---|---|---|---|---|---|

df | Deviance | ||||

NULL | 10,052 | 10,204.7 | |||

Age | 1 | 1,367.89 | 10,051 | 8,836.8 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Semester | 1 | 2,568.04 | 10,050 | 6,268.7 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Sex | 1 | 25.08 | 10,049 | 6,243.7 | 5.503e-07 |

GPA | 1 | 456.07 | 10,048 | 5,787.6 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Engagement in club activities | 1 | 567.49 | 10,046 | 5,189.4 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Nationality | 1 | 0.12 | 10,045 | 5,189.3 | 0.72881 |

Parental address | 1 | 3.41 | 10,044 | 5,185.9 | 0.06478 |

Extracurricular activities score | 1 | 643.46 | 10,043 | 4,542.4 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Number of volunteer activities engaged in | 1 | 3.40 | 10,042 | 4,539.0 | 0.06518 |

Number of surveys completed evaluating satisfaction with extra-curricular activities | 1 | 249.53 | 10,041 | 4,289.5 | [TeX:] $$2.2 \mathrm{e}-16$$ |

Number of surveys completed evaluating satisfaction with the department | 1 | 72.42 | 10,040 | 4,217.1 | [TeX:] $$2.2 \mathrm{e}-16$$ |

2.2 \mathrm{e}-16 | 1 | 3.42 | 10,039 | 4,213.6 | 0.06447 |

Engaged in freshman camp activities | 1 | 38.62 | 10,038 | 4,175.0 | 5.148e-10 |

We used the accuracy, precision, recall, F-score, and area under ROC curve (AUC) parameters to evaluate the four models, as follows. In particular, we selected the F-score and AUC because the class ratio of the experimental data is imbalanced.

To facilitate comprehension of the above equations, Table 4 shows the confusion matrix. The true positive (TP), true negative (TN), false positive (FP), and false negative (FN) are defined as follows.

TP: The model predicted that students dropped out, and they did in fact drop out.

TN: The model predicted that students did not drop out, and this was in fact the case.

FP: The model predicted that students dropped out, but they did not drop out (type I error).

FN: The model predicted that students would not drop out, but they did drop out (type II error).

Table 4.

Actual truth | |||
---|---|---|---|

True | False | ||

Prediction result | True | TP | FP (type I error) |

False | FN (type II error) | TN |

Both TP and TN (only) must be in play if a predictive model is to match all real truths, associated with an accuracy of unity. In this sense, precision is a measure of the extent of type I error, and recall is a measure of the extent of type II error.

Fig. 6 compares the four methods in terms of accuracy. The NB model was less accurate than the other models; the application of the MP method to Case #5 yielded the greatest accuracy (i.e., 0.95). The accuracies of the LR and MP methods increased as more independent variables were added, but the accuracies of the DT and NB methods did not.

Fig. 7 compares the four methods in terms of precision. The NB method was less precise than the other methods; the application of the DT method to Case #5 yielded the greatest precision (i.e., 0.91). Thus, the use of the NB method resulted in more type I errors than did the use of other methods. The precisions of the LR, DT, and MP methods increased as more independent variables were added, whereas the precision of the NB method did not. The weakness of the NB method is that the dependence between independent variables has a relatively poor predictive performance. As a result of this experiment, the dependency between independent variables such as (age, semester) and (engagement in club activities, extracurricular activities score) seemed to have influenced the performance of the NB method.

Fig. 8 compares the four methods in terms of recall. The degree of recall was lower for the NB method than for the other methods for Cases #1, #2, #3, and #4; however, the degree of recall for the NB method for Case #5 was the highest (i.e., 0.92). Thus, type II errors created using the NB method were significantly reduced by optimizing variable selection (i.e., from Case #4 to Case #5). The degrees of recall of the LR, NB, and MP methods increased as more independent variables were added, whereas the degree of recall of the DT method did not.

Fig. 9 compares the four methods in terms of the F-score. The F-score was lower for the NB method than for the other methods. As shown in Fig. 9, the highest F-score was 0.87, obtained by using the MP method to analyze Case #5. Fig. 10 compares the four methods in terms of AUCs. As shown in Fig. 10, the highest AUC was 0.98, obtained when the MP method was used to analyze Case #5. Thus, the predictive model generated by analyzing the independent variables of Case #5 via the MP method showed the best performance. However, the results of the MP method do not differ much from those of the LR method. The LR method is similar to the one-layer neural network and divides the pattern space linearly into two regions. On the other hand, the MP method, which is a two-layer neural network used in this paper, divides the pattern space into convex regions, which is theoretically better than the LR method. We leave it to future studies that the MP method yields much better results than the LR method.

Here, we used LR, a DT, an NB model, and an MP to create predictive models that might provide information for the prevention of student dropout. Predictive models using independent variables selected with the aid of variance analysis and the MP method showed the best performance (the F-score and AUC were 0.87 and 0.98, respectively).

We will improve the performance of the MP model and apply the optimized model to our school management system to better prevent dropout. We will counsel students who are at risk (as revealed by data analysis), and establish a data-driven campus management plan embracing student guidance, the living environment, and campus activities.

He received his Ph.D. of Computer Science Education from Korea University where he did research in distributed systems. He was previously a research professor at Korea University, a research scientist at University of Houston, and a senior engineer at Samsung Electronics. He is currently working as an assistant professor in the Depart-ment of Artificial Intelligence and Big Data Engineering at Daegu Catholic University. In the past, he authored and co-authored over publications covering research problems in distributed systems, computer architecture systems, mobile computing, p2p computing, grid computing, cloud computing, computer security, and computer science education.

She received her Ph.D. degree in Computer Science and Engineering from New South Wales University, Sydney, Australia in 2003. She is currently a professor in the School of Computer Software at Daegu Catholic University, South Korea. Her research interests include knowledge acquisition, machine learning, knowledge management and retrieval, computer science education, and Internet of Things.

He received his Ph.D. degree in Statistics from Korea University, Korea in 1989 where he did research in kernel estimation of density function and regression function. He is currently working as a professor in the Department of Artificial Intelligence and Big Data Engineering at Daegu Catholic University since 1994. In the past, he authored and co-authored over publications covering research problems in various statistical computing and computational works based on statistical simulation, computer science and big data areas. His recent research interests include artificial intelligence, big data computing and data science.

He received his Ph.D. degree in Computer Science and Engineering from Korea University, Korea in 2000. Before joining in School of Computer Software, Daegu Catholic University, he was a senior researcher in Supercomputing Center, Korea Insti-tute of Science and Technology Information (KISTI), Daejeon, Korea from October 2002 to February 2006. From June 2001 to May 2002, he was a visiting research associate in the Department of Computer Science at the University of Illinois at Chicago, USA. His recent research interests include artificial intelligence, big data computing, cloud computing, distributed and parallel computing, and wireless sensor networks.

- 1 A. Dutt, M. A. Ismail, T. Herawan, "A systematic review on educational data mining,"
*IEEE Access*, vol. 5, pp. 15991-16005, 2017.doi:[[[10.1109/ACCESS.2017.2654247]]] - 2 D. Gasevic, V. Kovanovic, S. Joksimovic, "Piecing the learning analytics puzzle: a consolidated model of a field of research and practice,"
*Learning: Research and Practice*, vol. 3, no. 1, pp. 63-78, 2017.custom:[[[-]]] - 3 N. Hoff, A. Olson, R. L. Peterson, "Dropout screening and early warning,"
*University of Nebraska-LincolnNE, USA*, 2015.custom:[[[-]]] - 4
*American Institutes for Research, 2019 (Online). Available:*, http://www.earlywarningsystems.org/ - 5
*Wisconsin Department of Public Instruction, c2021 (Online). Available:*, https://dpi.wi.gov/ews/dropout - 6
*National Dropout Prevention Center for Students with Disabilities,*, https://dropoutprevention.org/ - 7 E. Y ukselturk, S. Ozekes, Y. K. Turel, "Predicting dropout student: an application of data mining methods in an online education program,"
*European Journal of OpenDistance and e-learning*, vol. 17, no. 1, pp. 118-133, 2014.custom:[[[-]]] - 8 L. M. B Manhaes, S. M. S. Cruz, G. Zimbrão, "W A VE: an architecture for predicting dropout in undergraduate courses using EDM," in
*Proceedings of the 29th Annual ACM Symposium on Applied Computing*, Gyeongju, South Korea, 2014;pp. 243-247. custom:[[[-]]] - 9 C. E. L. Guarin, E. L. Guzman, F. A. Gonzalez, "A model to predict low academic performance at a specific enrollment using data mining,"
*IEEE Revista Iberoamericana de tecnologias del Aprendizaje*, vol. 10, no. 3, pp. 119-125, 2015.doi:[[[10.1109/RITA.2015.2452632]]] - 10 A. Omoto, Y. Lwayama, T. Mohri, "On-campus data utilization: working on institutional research in universities,"
*Fujitsu Science Technology*, vol. 1, no. 51, pp. 42-49, 2015.custom:[[[-]]] - 11 D. Kuznar, M. Gams, "Metis: system for early detection and prevention of student failure," in
*Proceedings of the 6th International Workshop on Combinations of Intelligent Methods and Applications (CIMA)*, Hague, Holland, 2016;custom:[[[-]]] - 12 E. B. Costa, B. Fonseca, M. A. Santana, F. F. de Araujo, J. Rego, "Evaluating the effectiveness of educational data mining techniques for early prediction of students' academic failure in introductory programming courses,"
*Computers in Human Behavior*, vol. 73, pp. 247-256, 2017.doi:[[[10.1016/j.chb.2017.01.047]]] - 13 R. Gurusamy, V. Subramaniam, "A machine learning approach for MRI brain tumor classification,"
*ComputersMaterials and Continua*, vol. 53, no. 2, pp. 91-109, 2017.custom:[[[-]]] - 14 C. Y uan, X. Li, Q. J. Wu, J. Li, X. Sun, "Fingerprint liveness detection from different fingerprint materials using convolutional neural network and principal component analysis,"
*ComputersMaterials Continua*, vol. 53, no. 3, pp. 357-371, 2017.custom:[[[-]]] - 15 J. Kaur, K. Kaur, "A fuzzy approach for an IoT-based automated employee performance appraisal,"
*ComputersMaterials and Continua*, vol. 53, no. 1, pp. 24-38, 2017.custom:[[[-]]] - 16 N. Iam-On, T. Boongoen, "Generating descriptive model for student dropout: a review of clustering approach,"
*Human-centric Computing and Information Sciences*, vol. 7, no. 1, 2017.doi:[[[10.1186/s13673-016-0083-0]]] - 17 C. A. Christle, K. Jolivette, C. M. Nelson, "School characteristics related to high school dropout rates,"
*Remedial and Special Education*, vol. 28, no. 6, pp. 325-339, 2007.custom:[[[-]]] - 18 J. V asquez, J. Miranda, "Student desertion: What is and how can it be detected on time?,"
*in Data Science and Digital Business. ChamSwitzerland: Springer*, pp. 263-283, 2019.custom:[[[-]]] - 19 D. Olaya, J. V asquez, S. Maldonado, J. Miranda, W. V erbeke, "Uplift Modeling for preventing student dropout in higher education,"
*Decision Support Systems*, vol. 134, no. 113320, 2020.doi:[[[10.1016/j.dss..113320]]] - 20 D. Jampen, G. Gur, T. Sutter, B. Tellenbach, "Don’t click: towards an effective anti-phishing training: a comparative literature review,"
*Human-centric Computing and Information Sciences*, vol. 10, no. 33, 2020.doi:[[[10.1186/s13673-020-00237-7]]] - 21 D. Tang, R. Dai, L. Tang, X. Li, "Low-rate DoS attack detection based on two-step cluster analysis and UTR analysis,"
*Human-centric Computing and Information Sciences*, vol. 10, no. 6, 2020.doi:[[[10.1186/s13673-020-0210-9]]] - 22 J. R. Turner, J. Thayer,
*Introduction to Analysis of V ariance: Design, Analysis Interpretation*, CA: Sage Publications, Thousand Oaks, 2001.custom:[[[-]]]