top of page

Loan Approval


In this project, I will be using a decision tree algorithm and data on people applying for a loan to classify whether or not they get approved.

October 12th, 2023


The Problem

When applying for a loan, there is no clear way to find out if you will be approved before you actually go through the process of the application. You can look at your credit score to get a slight idea of whether you’ll be approved or not, but that’s still not entirely reliable. The purpose of this project is to use classification machine learning algorithms to determine the outcome of a loan application based on an applicant’s information such as annual income and the loan amount. The questions below will be answered by the end of this project:

  • What are the top 3 most important variables in determining the outcome of a loan application?

  • Does being self employed help or hurt your chances of getting approved?

  • What is a good credit score to have to be considered for approval?

The Data

The dataset I am using on this project is from Kaggle; I have linked the url below. The Kaggle user who posted this dataset is from India and it is important to mention this for a reason I will get to later. This dataset is in a csv file with 13 columns and 4,269 samples. The different columns include whether or not the applicant has graduated college, whether or not they are self employed, and their asset values. There is also a column named cibil score, which is simply a credit score used primarily in India.

Preprocessing the Data

This dataset has no empty values, which is nice but that’s not the only thing we care about when preprocessing the data. Now, If we look at a full description of the data we can see that the average annual income is 15 million. On the Kaggle page, there was no mention of what kind of currency the assets and income is measured in. So, considering the data uses the CIBIL score, the kaggle user is from India, and the average annual income seems very large in US dollars, it is safe to assume that these features are using the Indian Rupee. To make this data more understandable to me, I will be converting the income and assets to USD. This is done simply by multiplying these columns by 0.012, which is how much an Indian Rupee is worth in USD. I’d also like to point out the smallest value under residential assets is a negative number. This might seem weird at first, but if an applicant already has a loan for a house with no other residential assets then the net residential assets value is in the negative. The last step is to take the graduate, self employed, and loan status columns and turn their string values of "Yes" or "No" into true or false values.

Data Visualisation

Taking a look at this graph on the right here, we can answer one of our questions. The bar graph displays the relationship between the self employment status of an applicant and the outcome of the loan applications. The graph makes it pretty clear that whether an applicant is self employed or not, it bares little to no weight on the loan status. This is certainly an unexpected answer to the second question.

Modeling the Data

When modeling the data, we will of course split the data into four appropriate dataframes: the features and target values of the training data as well as the features and target values of the testing set. Since we have such a large amount of samples, I made the split on the training and testing set 50/50. This is because I will be using a decision tree as the classification algorithm. The 50/50 split avoids the tree having too many branches compared to a 80/20 split that could lead to overfitting.

I choose to use the decision tree algorithm because it accepts both categorical and numerical data. This dataset has a mix of both of those, so using this method avoids the need for normalization. I also wanted something that is easy to read and understand; decision trees have an upper hand in this category compared to other classification methods.

Now let's take a look at that decision tree:


As you can see, this decision tree uses the gini index to detect impurities in the data and essentially sort them out. Now that we have this visual we can answer the third question using the root node of this tree. The algorithm decided that the best place to split the CIBIL score was at 549.5. This tells us that an applicant has pretty good chances of getting approved if their credit score is 550 or more. Of course, it doesn't guarantee it will get approved, but it certainly helps your chances.


When evaluating the predictive abilities of our model, we want to use our testing data that we already have split off from the training data. To get an idea of how well our model works with unseen data, we will measure the accuracy of the model with the testing data. This code snippet easily compute that for us:

Wow, this model has a 97.8% accuracy rate. That is pretty good. Now let's answer the final question by looking at a bar graph that displays the importance of each feature in order. 

Now with this graph we can answer the first question. The top 3 most important variables in determining the loan outcome are credit score, the loan term and the applicant's annual income, in order. This graph also emphasizes the answer of the second question in that self employment is literally the least important feature out of this dataset in determining the loan approval outcome.


This project reveals that the most important feature when determining a loan approval status is the credit score of the applicant. By a landslide 85%, the credit score towers over the other features when determining the outcome of the loan approval. This is most likely because credit scores are made up of a bunch of other features about a person's credit history. The credit score essentially acts as an all-in-one for multiple other features not considered in this dataset so it makes since why this would have the greatest importance. 

All of the questions at the beginning of this project have been answered as well. The second question was kind of unexpected with self employment having practically no effect on the outcome. The third question was answered with the help of the decision tree, showing that a credit score of 550 or above is helpful in securing a loan.

There is one problem with this dataset and that is the approval status is probably coming all from one bank or organization. Other banks may have different standards and policies when determining an applicant's request for a loan. They could be using different or additional data that was not even considered in this dataset.


This project could help people that are looking into procuring a loan by highlighting the important aspects of what a person would need get approved. Navigating big financial decisions like a loan can be daunting and confusing and hopefully this project could shed some light on the subject. 

References & Code

bottom of page