Taiwan Credit Card Default Prediction
A little over a decade ago, Taiwan suffered massive debt from credit card defaults, which led to societal issues. Although this problem has greatly subsided, being from Taiwan and studying finance made me want to look into what had happened and if we could have avoided or maybe even predicted default clients. The dataset that I dove into is of client information of 2005, the year before the massive debt and issues that came along with it hit.
Explanation of the columns of the dataset (ID column dropped because of redundancy):
LIMIT_BAL: Amount of given credit in NTD. (Includes individual consumer credit and family credit)
PAY_1: Repay status of September, 2005
PAY_2: Repay status of August, 2005
Measurement scale of repay status: -1 = pay duly; 1 = payment delayed for 1 month; 2 = payment delayed for 2 months etc.
dpnm: Default payment next month (No=0; Yes=1)
Here is a simple bar plot that shows how many in our dataset defaulted.
This heat map below of correlation of columns in the dataset shows that there is high correlation amongst the repay status, billing amount, and payment amount. Yet the thing to notice here is that the highest correlation that “dpnm” (our dependent variable) has is with the repay status.
The first model I tried to use in predicting whether clients would default or not was a Random Forest Model. It gave an accuracy of about 0.809, and its highest features in terms of importance were the repay status. And to my surprise (along with the rest of the models), gender, education, marital status and age did not account for much (discovered through printing of feature importance of the model). Below is the confusion matrix of the random forest model.
The second model I tried was a simple Decision Tree Model. I tried using both “entropy” and “gini” as the criterion, but the two did not end up giving much different accuracy. Below are the decision trees and confusion matrices that they produced respectively. Again, in terms of feature importance, repay status had the highest importance, but although LIMIT_BAL did not crack the top 3 most important, it holds most importance when comparing the decision tree models, random forest model, and adaBoost model.
After fine-tuning my decision trees using RandomizedSearchCV, I found that max_depth of 3 and the entropy criterion would give the best accuracy score. Then I pruned my decision tree, and with this plot below, we can see that the entropy criterion gives an overall better accuracy level, yet the difference does not seem to be very significant. One consistency is that max_depth at around 3 yields the best accuracy rate (as well as for the other models).
I tried several different models, including the ones mentioned above, and also two dummy models and an AdaBoost model. After ranking their accuracies, I decided that the decision tree model would be most suitable for doing credit card default prediction.
I also wanted to look at how a boosting model, specifically the XGBoost model, would perform. Below are the second decision tree plotted using the XGBoost model and also a chart that shows the importance of its parameters.
Throughout the process of trying to find the highest performant model to train my dataset on, I observed that several methods and models amounted to around the same accuracy level (from 0.80 to 0.82). Perhaps through more pruning or other ML models I would be able to increase the accuracy in predicting default of clients, and hopefully help in discovering how Taiwan and other countries can avoid future financial crises like the one Taiwan suffered in 2006.