Bangaluru House Price Prediction Using ML

Introduction

In this tutorial, we will implement a Bangalore House Price Prediction model using a Machine Learning algorithm. This model predicts the price of Bangalore's house with the help of a few parameters like availability, size, total square feet, bath, location, etc. 


During this Bangaluru House Price prediction using Machine Learning tutorial you will learn several things like :-

  1. Exploratory data analysis
  2. Dealing with a missing values or noisy data
  3. Data preprocessing
  4. Create new features from existing features
  5. Remove outliers
  6. Data visualisation
  7. Splitting data into the training and testing 
  8. Train linear regression model and test.

I have trained a Bengaluru House Price prediction model using linear regression algorithm and I got 86% accuracy over the testing data.


About dataset :-

What are the things that a potential home buyer considers before purchasing a house? The location, the size of the property, vicinity to offices, schools, parks, restaurants, hospitals or the stereotypical white picket fence? What about the most important factor — the price?


For example, for a potential homeowner, over 9,000 apartment projects and flats for sale are available in the range of ₹42-52 lakh, followed by over 7,100 apartments that are in the ₹52-62 lakh budget segment, says a report by property website Makaan. According to the study, there are over 5,000 projects in the ₹15-25 lakh budget segment followed by those in the ₹34-43 lakh budget category.


Buying a home, especially in a city like Bengaluru, is a tricky choice. While the major factors are usually the same for all metros, there are others to be considered for the Silicon Valley of India. With its millennial crowd, vibrant culture, great climate and a slew of job opportunities, it is difficult to ascertain the price of a house in Bengaluru.


Let’s start :-

Common step is to load all the required libraries and load the Bengaluru house data set using the Pandas function read_csv() and display the top five rows of the data set using the head() method.


Now perform an Exploratory Data Analysis. In EDA, Check the shape of the data set using the shape method. It displays the number of rows and number of columns. Then display the percentage of null values like how much percent it contains NULL values. Then check the value count of the area_type column. Then drop some features (columns) which are of no use to train our model. The features which we are going to drop are availability, area_type, society, balcony. Now display the data set.


Then again check if there are Null values or not. So, you can see there are some null values. Then we drop all the rows which contain null values using the method dropna().
Then check the shape of the data set and display the top 5 rows of the data set.

Now check the unique values of size feature and you can see there are different types of values like in BHK, bedrooms etc. So, we write a function to extract only the starting integer values from the size feature and store it into a new bhk feature. And now you can see the size feature of the data set. Now drop the size feature which is of no use now.


Now it's time to remove the outliers from the BHK. firstly check the BHK greater than 22. If it’s greater than 22 which means it’s outlier. Now check the unique values of total_sqft which contain integer values (Like 2000), range values (2000-3000) and mixed data type values (2000Sq Meter). 

Now create a user defined function is_float()  with the the total_sqft as an argument and return all the floating (function convert integer values into float). Then we apply a function on the total_sqft feature. But we apply this function using a tilt(~) symbol which returns all values except floating type. It means, it returns a range and mixed data type values as you can see in the below output.


Now implement a convert_sqft_into_number() function which takes a total_sqft feature as an argument and if the type of value if integer then simply convert into float and return, if the type of value is range then take an average of both and return, if the type of value is mixed data type then return None because this type of value is only one in total_sqft feature. Then apply it on the total_sqft feature.
Then create a new feature price_per_sqft from the existing feature price and total_sqft. And display the data.


Now display the value counts of the location feature and create an anonymous function to remove the spaces from the left side and right side. After removing the spaces, you can see the count of location. Before Removing the spaces, the count was 1304 and after removing the spaces, the count was 1293.


Create a new variable loc_less_than_10. It contains locations which are less than 10.


Then create an anonymous function which applies to the location. This function returns all the locations where the count of location is greater than 10, if the count of location is less than 10 then return ‘other’. Now the unique location becomes 242 from 1293.
Now remove outliers from the bhk features. All bhk removed from where bhk less than 300.

Now describe a price_per_sqft feature and in this, you can see the outlier. House price is 176470

Lakh which is not possible according to location and total square feet. So create a function remove_outlier_from_price_per_sqft(). It takes a dataset and uses a Standard Deviation technique to remove outliers. After applying this function, you can see the description below.



Now visualize the “Rajaji Nagar” location with 2 bhk and 3 bhk. 2 bhk is in blue color and 3 bhk is in green color. So you can see in the below graph that the 3 bhk house price is less than the 2 bhk house price.


No again use a Standard Deviation technique to remove the outliers from the price_per_sqft.



Now again visualize the same graph and now you can see that 3 bhk house price is higher than the 2 bhk house price. Some 3 bhk house prices can be less than the 2 bhk price because of the location.


Now check the unique values of the bath and you can see, it contains a 16 bath in one house which makes no sense. Now display the houses who have greater than 10 baths.


Now visualize the number of baths using a histogram graph. 



Keep only those houses who have only less than bhk-1. For example: if a house is of 4 bhk, then it contains only 3 baths (bhk-1). Now check the shape of the data and now the data set contains 7325 rows and 6 columns.

Now drop a price_per_sqft which is of no use and display the final data and now it still contains a categorical feature (location).



Now apply a one hot encoding to convert a categorical feature into numeric feature. And store into a “dummies” data set.

Now concate dummies data set with our final data set and remove a “other” column from “dummies” data set. We can identify a “other” location like if all locations are “0” then automatically “other” is “1”.



Then see the final data set. But it contains a location feature which is of no use now. So drop the location and display the final preprocessed data set.
Then check the shape of the final data set.


Now it's time to prepare the data set. Data set is split into the independent and dependent features and stored into the “x” and “y” data set. And check the shape of “x” and “y” as you can see below.

Then split the data set into the training and testing using the train_test_split() method which returns 4 data sets as you can see in the below image. Then check the shape of all four data sets.

Now define our linear regression model and train the model using the training data set and check the score of the model using the validation data sets.



Now test the model using the testing data set and after testing you can see our model predicts below values and you can also see the actual values.


Create a function to test the model on a custom data set which takes the location, sqft, bath, bhk, etc. So, I tested a model on 3 custom data sets as you can see in the below image.

Now save a model using a joblib library with the name “banglore house price prediction model.pkl”.



Source Code

  1. Go to my GitHub and fork or download the repo: Bangaluru House Price Prediction
  2. Open .ipnyb file in jupyter notebook.
  3. Now you can use it.

Video Tutorials


Thank You !!!!!!!!!!!!!!


4 Comments

If you have any doubts, Please let me know

  1. Good job sir, I like your consistency. Thank you sir for your efforts.

    ReplyDelete
  2. Hello Expert, if I provide data set of Indore city Hometowns can you predict the price.

    ReplyDelete
Post a Comment
Previous Post Next Post