FORECASTING RESIDENTIAL PROPERTY VALUES

- Motivation -

Accurately determining the price of houses is crucial for both listing agents at real estate companies and individuals seeking to buy or sell properties. For listing agents, accurately pricing a house ensures that it attracts potential buyers while maximizing the seller’s profit. On the other hand, for individuals looking to sell their home or buy their dream house, knowing the accurate price is essential for making informed decisions and negotiating fair deals. Inaccurate pricing can lead to missed opportunities, financial losses, or prolonged periods on the market. Therefore, precise valuation is paramount for all parties involved in the real estate market to facilitate smooth transactions and ensure optimal outcomes. Utilizing various machine learning algorithms, I believed that I would be able to create a model that could be used by real estate companies and individuals to accurately predict residential house prices.

- Application -

I start by sourcing a comprehensive dataset that encompasses a wide array of house attributes, including lot size, bedroom count, garage area, and more, each paired with its corresponding sale price. A particularly robust dataset available on Kaggle boasts an extensive compilation of 80 features. Within this collection, 37 features capture numerical data points like the year of construction, house quality, yard dimensions, among others. The remaining 43 features delineate categorical values, presenting a total of 251 distinct descriptors. For example, consider the “Paved Driveway” feature, which offers choices between Y for Paved, P for Partially Paved, and N for Dirt/Gravel.

Firstly, I want to look at the histogram of the datasets numerical features. A histogram will show us the frequency distributions of each feature. This will give a nice overview of the data that we have to work with.

Next, we’ll examine the correlation among the numerical features using a heatmap. This visualization tool aids us in identifying features with either high or low correlation by assigning distinct colors to each feature combination. The correlation values x, range from -1 to 1, where a correlation of -1 ≤ x < 0 signifies an inverse relationship, meaning that as one feature increases, the other decreases. A correlation of 0 indicates no discernible relationship between the two features. Conversely, a correlation of 0 > x ≥ 1 indicates a positive correlation, signifying that as one feature increases, the other also increases.

We can also utilize a two dimensional scatter plot with hue to visualize the correlation between two features and their corresponding sale price. For example, we can quickly determine that the year built and 1st floor square feet have a high influence on sale price.

As Python is only able to apply calculations and algorithms to numerical data, we need to alter the categorical features to be used for our applications. There are a few ways to do this. The technique that I have opted for does this by using the get_dummies function within the pandas library. This function performs One-Hot encoding on categorical features within the dataset. What this method does is it creates new ‘sub’-features for each unique description within each categorical feature. It then provides a respective 1 if the ‘sub’-feature is present within the record or a 0 otherwise. An example is provided below.

One downside of One-Hot encoding is the substantial increase in dimensionality, since a sperate column is created for each unique description. This can increase complexity and slow down the model. It will also lead to sparse data since the majority of elements will have the value of 0. Additionally, there is a possibility of overfitting if the sample size is not large enough. If the dimensionality does appear too large, creating a new ‘sub’-feature labeled “Other” can help by giving a place for low occurring unique descriptions instead of providing a new separate column for it. After utilizing One-Hot encoding on our dataset, we receive a total of 288 features.

Now that the categorical features are numerical, we can split the dataset into testing and training. I will be utilizing 20% of the total data for testing and the remaining 80% for training. I will be using Scikit-Learn’s train_test_split function for convenience.

Since our dataset contains the desired dependent variable ‘Sale Price’, we will run our data through supervised learning algorithms. I will be using Linear Regression, Random Forest Regressor, and Random Forest Regressor with Hyperparameter Optimization.

Finally, we will run the corresponding functions to determine the final accuracy of each algorithm.

Linear Regression received an accuracy of 86.4%.

Random Forest Regressor received an accuracy of 86.3%.

Random Forest Regressor with Hyperparameter Optimization received an accuracy of 86.6%.