Joshua DePoy

# House Prices

In this project, I will be using liner regression to find what features of a house are the biggest determining factor of the sale price.

## Date

October 25th, 2023

# The Problem

When selling a house, you want to make sure that the price you are selling it at is reasonable compared to the features of the house itself. There are many features to consider, including but not limited to square footage, number of beds and baths, and the year it was built. Which of these many features will have the greatest effect on the house price? That is what we will be looking at in this project.

# The Data

The question from above will be answered using a dataset found on a Kaggle competition. This dataset has 81 different features on 1,460 different houses. These houses have already been sold in the past and we have the price that the house sold for. There are also features such as the year the house was built, the size of the land it is on, and the square footage of the house that is above ground. I expect to be using these features in determining the sale price, but we will see what I will actually be using based on their correlation with sale price.

# Linear Regression

Linear regression is what I will be using in this project to predict what the sale price was, based on various features. This method essentially plots the independent variable (the feature) and the dependent variable (the target value) on a scatter plot. Then starting with a straight line that is at the y value of the average of the target values, we measure the sum of the squared residuals. The residual is the difference between the line and the actual y value at that point. So we square each of them and then sum their squares. Doing this over and over again, we find the optimal line to fit the data which is the least of the sum of squares also known as “least squares”.

# Experiment 1

### Data Understanding

Let’s start working with the data! The first thing we should check is the heatmap of the correlation data. This will give us an informative visual of how closely the different features correlate with the SalesPrice target value.

As you can see there are a lot of features. The only row we care about though is the bottom one as it displays the correlation between SalesPrice and all the other features. The feature with the highest correlation is the one with the highest number and the brightest shade of orange. OverallQual, the overall material and finish quality of the house, seems to be the feature with the highest correlation. This is a little surprising, so let’s look into it a bit further by plotting all their datapoints on a scatterplot.

With this graph it's clear there is a pattern of higher sale prices with higher material quality ratings. So let's model it.

### Data Preprocessing

The preprocessing on this dataset didn't require much effort, which is to be expected since this came from a Kaggle competition so the data is naturally pretty clean. First I separated the columns we needed from the main data frame. There were no null values in the selected columns either. Finally the last step is to split the data into training and testing data. On this project I stuck with the default 75/25 split.

### Data Modeling

The modeling is the easy part here. I used sklearn's Linear Regression method to fit the training data and find the coefficient and the intercept of the line. The output is below.

This tells us that if you were given a house with a material quality rating of 8, this method would predict that house is (8 * 44,774.94) - 92,743.83 = 265,455.69. So about $265,455.69

I'd like to acknowledge here that if you were to input a 1 as the OverallQual, the model would predict a negative sale price. This is most likely because the data we got for OverallQual mainly has scores in the 3-10 scores area while there are only 4 total houses in the 1 to 1 score range. This makes it clear that the data is bias towards higher scores so it based the model around those higher scores, leaving the lower scores in the dust. So later on we will find a better way to model this data.

### Data Evaluation

Now we evaluate how well the model does at predicting the sale price based on the testing data. We will use the root mean squared error (RMSE) to see how well the model works in relation to other features. This number is meant to be as close to 0 as you can get, but the variation on SalePrice is so high that the RMSE is not going to look very good with what we are trying to determine. The R squared score is a better standardized evaluation measure. It tells us what percentage of the variance of the dependent variable is explained by the variance of the independent variable. For example, if the R squared is 1, then 100% of the variance in the dependent variable is explained by the variance of the independent variable(s). For this model we know that 64.9% of the variance in the dependent variable is explained by the variance in the independent variable. This tells us that OverallQual is a pretty good predictor for Sale Price, for the most part.

# Experiment 2

For this experiment I want to use the second best correlated feature which is GrLivArea. This feature is the above ground square footage of the house. I'll use the same linear regression class from sklearn to find how closely they really correlate to each other. The preprocessing steps are pretty much the same from Experiment 1; getting the columns we need from the data frame, splitting the data into training and testing and then making sure there are no nulls. Now after modeling it, we get the coefficient and the intercept.

This tells us that if we have a house that's 1,100 square feet, the model predicts that that house will sell for about $137,518.58. Let's check what the evaluation measures tell us about how accurate the model is.

So the RMSE is higher when using GrLivArea compared to OverallQual. This must mean OverallQual is a better feature to use for predicting the price value than GrLivArea. We can also see that the GrLivArea accounts for about 55.5% of the variance in SalePrice. That's not a bad R squared score but it's definitely not as much as OverallQual.

# Experiment 3

Now that we have seen that GrLivArea and OverallQual are both decent predictors, let's put them both in the model at the same time. This kind of linear regression is called multiple linear regression since it has more than 1 feature used to predict the target value. The only thing we have to change about setting up this model is keeping the two columns in one data frame while the target value stays in its own data frame. Now we do everything else like normal, splitting the data, and fitting it. Let's take a look at the coefficients and intercept.

Look at this, 2 coefficients. This gives us the ability to put in different variables for GrLivArea and OverallQual and get one value for SalePrice. For example, if there's a house that is 1,700 square feet and has a quality rating of 5, then the model would predict the sale price to be about $153,415.20 = (5*32,921.095 + 1,700*52.689 + (-100,792.258)). Checking the evaluation measures we get...

So this new model pays off pretty well. The RMSE is lower than the other two experiments, so it's safe to say that combining these two features has made the model better. The r squared score is also better with a score of 74.9%, also better than the two features on their own. Clearly this last experiment has been a success!

# Impact

There can certainly be a negative impact here; the models presented in this project could underestimate what a house is worth. If a person were to trust the underestimate that the model outputted, they could miss out on ten of thousands of dollars in the sale. And if the model predicts too large of an overestimate, that could also prevent the sale of the house entirely, wasting the time of the seller. There are certainly some real problems that could occur if this model was fully trusted.

# Conclusion

There are certainly a lot of things I have learned in this project. I've found that adding in new features to the input data of a model will help improve the model a significant amount. It makes it so that the model is better than if you used the features on their own. I'm sure that if I continued the pattern of adding independent features from the dataset into the multiple linear regression model, it would make the model even better.

# References & Code

Kaggle Dataset: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview

Jupyter Notebook: House_Prices.ipynb