Predictive Modeling: House price analytics Bangalore, India

22 minute read

Housing situation in Bangalore, India.

Housing is increasingly becoming a basic need for every human being. This mini paper seeks to provide some insight into this basic need by exploring and analysing the housing property market in Bangalore. Bangalore, sometimes called the silicon valley of India, is the state capital of Karnataka. Located in south, the state is host to over 10 million people making it the third most popular city in India. In terms of activity, it plays a role as the nations chief technology hub and exporter. This is also likely to mean that Bangalore provides a slew of employment opportunities to the Indian population. It is also reasonable to assume this scenario could stress the housing situation. This is not surprising because large cities offer better jobs with higher incomes. According to the paper of Sheikh [1], the property market in India differs drastically from the rest of the world as it experiences rapid growth among other factors. Property value on the other hand can be difficult to determine. It is not that apparent what drives the cost of residential property in a large and developed city. In addition to this, although budget is a determining factor in property buying, it’s not also clear what preferences influence the residential buyer or which ones they choose to prioritize. This mini paper puts some variables under study such as; the area type, availability, location, size, society, total square feet, number of bath rooms, balcony availability and price to understand the scenario in Bangalore.

Problem statement

Buying a home in Bangalore is especially a tricky choice. Buyer choice can be inspired by different aspects as such it’s difficult to ascertain property price. This leads to the question, what characteristics does a potential residence buyer consider before making purchase? The answer to this question will give us an understanding into the buyer dynamics in the metro of Bangalore.

Hypothesis

Since Bangalore is a silicon valley with a slew of opportunities for many Indians it’s more reasonable to think that people will find it pleasant to live just about anywhere. We hypothesize that the number of rooms has no effect on the sale price.

Data source

The data to be used were curated by a specialized team in India over months of primary and secondary research. The data are also publicly available online and distributed under the creative commons license on the Kaggle platform [3]. The variables under scrutiny are either categorical or continuous and cover details like; the area type which describes the type build in an area. Availability which indicates whether a house is available for possession or when it will be ready. Location which tells us where the residential property is situated in the metro. For example, along an airport highway. Price, which tells us the commercial value of the asset in lakhs or Indian rupee. Size, which refers to the number of bedrooms in a particular residential property. Bath tells us how many bath rooms a residence. Total square feet which gives a hint at the area the property occupies. The remaining variables like balcony and society detail how many balconies are on a property and which social group the property belongs.

Method

After acquiring the the data, the next steps will be to clean it (remove any outliers, convert any categorical variables one hot encoded vectors), understand the variables (their distributions) , perform some inferential statistics and lastly perform predictive modeling using linear regression. The software environment for this project will be python using an IDE or Integrated development environment (python interpreter).

Now let’s jump right into the code!

# import convinience functions

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# statistic package
import scipy.stats as stats 
from statsmodels.stats.weightstats import ztest
import math

import warnings
warnings.filterwarnings(action="ignore") # turn off warnings

# import learners and other dependancies
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, ShuffleSplit
from sklearn.linear_model import LinearRegression, ridge_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
np.random.seed(seed=123)


%matplotlib inline

# import data and read first 5 rows
df = pd.read_csv("Bengaluru_House_Data.csv")
df.head()

	area_type	availability	location	size	society	total_sqft	bath	balcony	price
0	Super built-up Area	19-Dec	Electronic City Phase II	2 BHK	Coomee	1056	2.0	1.0	39.07
1	Plot Area	Ready To Move	Chikka Tirupathi	4 Bedroom	Theanmp	2600	5.0	3.0	120.00
2	Built-up Area	Ready To Move	Uttarahalli	3 BHK	NaN	1440	2.0	3.0	62.00
3	Super built-up Area	Ready To Move	Lingadheeranahalli	3 BHK	Soiewre	1521	3.0	1.0	95.00
4	Super built-up Area	Ready To Move	Kothanur	2 BHK	NaN	1200	2.0	1.0	51.00

print(f"The dataset has {df.shape[0]} observations")

The dataset has 13320 observations

There are some ethical issues to consider so we make a simple assumption that when a house will be available, the society it belongs to, the number of balconies and it’s area type will not be used to determine the final price.

cols_to_drop = [
    "area_type",
    "availability",
    "society",
    "balcony"
]
df_final = df.drop(columns=cols_to_drop)

df_final.head()

	location	size	total_sqft	bath	price
0	Electronic City Phase II	2 BHK	1056	2.0	39.07
1	Chikka Tirupathi	4 Bedroom	2600	5.0	120.00
2	Uttarahalli	3 BHK	1440	2.0	62.00
3	Lingadheeranahalli	3 BHK	1521	3.0	95.00
4	Kothanur	2 BHK	1200	2.0	51.00

Data cleaning

df_final.isna().sum() # check the number of missing values

location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64

the observation here is that we have very few missing values compare to 13000 observations so we can just drop them to make the analysis simpler

df_final.dropna(inplace=True) # drop missing values

df_final.head()

	location	size	total_sqft	bath	price
0	Electronic City Phase II	2 BHK	1056	2.0	39.07
1	Chikka Tirupathi	4 Bedroom	2600	5.0	120.00
2	Uttarahalli	3 BHK	1440	2.0	62.00
3	Lingadheeranahalli	3 BHK	1521	3.0	95.00
4	Kothanur	2 BHK	1200	2.0	51.00

upon droping missing values, we oberve that size has a wierd naming skim
we can inspect this and find a way to correct it

df_final["size"].unique()

array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)

# we can use a simple function to remove the number of rooms from the column "size"
df_final["BHK"] = df_final["size"].apply(lambda x : int(x.split(" ")[0])) # convert no. of rooms to int
df_final = df_final[['location', 'size','BHK', 'total_sqft', 'bath',  'price']] # re-arrange cols
df_final.head()

	location	size	BHK	total_sqft	bath	price
0	Electronic City Phase II	2 BHK	2	1056	2.0	39.07
1	Chikka Tirupathi	4 Bedroom	4	2600	5.0	120.00
2	Uttarahalli	3 BHK	3	1440	2.0	62.00
3	Lingadheeranahalli	3 BHK	3	1521	3.0	95.00
4	Kothanur	2 BHK	2	1200	2.0	51.00

# let's inspect the number of rooms 
df_final["BHK"].unique()

array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18])

some houses have 43 bedrooms. It would also be interesting to look at the total sqft occupied by these kind of houses with more than 10 rooms and see if there is any relationship.

df_final[df_final["BHK"]>10]

	location	size	BHK	total_sqft	bath	price
459	1 Giri Nagar	11 BHK	11	5000	9.0	360.0
1718	2Electronic City Phase II	27 BHK	27	8000	27.0	230.0
1768	1 Ramamurthy Nagar	11 Bedroom	11	1200	11.0	170.0
3379	1Hanuman Nagar	19 BHK	19	2000	16.0	490.0
3609	Koramangala Industrial Layout	16 BHK	16	10000	16.0	550.0
3853	1 Annasandrapalya	11 Bedroom	11	1200	6.0	150.0
4684	Munnekollal	43 Bedroom	43	2400	40.0	660.0
4916	1Channasandra	14 BHK	14	1250	15.0	125.0
6533	Mysore Road	12 Bedroom	12	2232	6.0	300.0
7979	1 Immadihalli	11 BHK	11	6000	12.0	150.0
9935	1Hoysalanagar	13 BHK	13	5425	13.0	275.0
11559	1Kasavanhalli	18 Bedroom	18	1200	18.0	200.0

it is questionable if a house of 43 bedrooms will have 40 bathrooms and also if occupying a total area of 2400 makes much sense?

df_final["total_sqft"].unique()

array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
      dtype=object)

some of the values in total_sqft are represented in a range, we might need a single value for this. Let’s take the avarage of such instances

df = df_final.copy()

def is_float(x):
    try:
        float(x)
    except:
        return False
    return True

df[~df["total_sqft"].apply(is_float)].head(10)

	location	size	BHK	total_sqft	bath	price
30	Yelahanka	4 BHK	4	2100 - 2850	4.0	186.000
122	Hebbal	4 BHK	4	3067 - 8156	4.0	477.000
137	8th Phase JP Nagar	2 BHK	2	1042 - 1105	2.0	54.005
165	Sarjapur	2 BHK	2	1145 - 1340	2.0	43.490
188	KR Puram	2 BHK	2	1015 - 1540	2.0	56.800
410	Kengeri	1 BHK	1	34.46Sq. Meter	1.0	18.500
549	Hennur Road	2 BHK	2	1195 - 1440	2.0	63.770
648	Arekere	9 Bedroom	9	4125Perch	9.0	265.000
661	Yelahanka	2 BHK	2	1120 - 1145	2.0	48.130
672	Bettahalsoor	4 Bedroom	4	3090 - 5002	4.0	445.000

def convert_sqft_to_num(x):
    tokens = x.split("-")
    if len(tokens) == 2:
        return (  float(tokens[0]) + float(tokens[1])  )/2
    try:
        return float(x)
    except:
        return None

df1 = df.copy()
df1["total_sqft"] = df1["total_sqft"].apply(convert_sqft_to_num)
df1.head()

	location	size	BHK	total_sqft	bath	price
0	Electronic City Phase II	2 BHK	2	1056.0	2.0	39.07
1	Chikka Tirupathi	4 Bedroom	4	2600.0	5.0	120.00
2	Uttarahalli	3 BHK	3	1440.0	2.0	62.00
3	Lingadheeranahalli	3 BHK	3	1521.0	3.0	95.00
4	Kothanur	2 BHK	2	1200.0	2.0	51.00

we have handled the total_sqft column

Feature Engineering & dimensionality reduction

df2 = df1.copy() # deep copy
df2["price_per_sqft"] = df2["price"]*100000/df2["total_sqft"]
df2 = df2[['location', 'size', 'BHK', 'total_sqft', 'bath', 'price_per_sqft', 'price']] # rearrange columns
df2.head()

	location	size	BHK	total_sqft	bath	price_per_sqft	price
0	Electronic City Phase II	2 BHK	2	1056.0	2.0	3699.810606	39.07
1	Chikka Tirupathi	4 Bedroom	4	2600.0	5.0	4615.384615	120.00
2	Uttarahalli	3 BHK	3	1440.0	2.0	4305.555556	62.00
3	Lingadheeranahalli	3 BHK	3	1521.0	3.0	6245.890861	95.00
4	Kothanur	2 BHK	2	1200.0	2.0	4250.000000	51.00

let’s go to the location column and inspect whats going on with it

print(f"we have { len(df2.location.unique()) } unique locations")

we have 1304 unique locations

len(df2["location"].unique())

we have about 1300 unique locations in the data. One way to deal with categorical data is to use one hot encoding. However, there are just to many levels for that kind of convertion and this would bring about the curse of dimensionality!
we can avoid the curse of dimensionality by binning those locations with few observations into their own category.

df2["location"] = df2["location"].apply(lambda x: x.strip()) # remove trailing and leading whitespace
location_stats = df2.groupby("location")["location"].count().sort_values(ascending=False)

location_stats # lets see the location counts from highest to lowest

location
Whitefield                                      535
Sarjapur  Road                                  392
Electronic City                                 304
Kanakpura Road                                  266
Thanisandra                                     236
Yelahanka                                       210
Uttarahalli                                     186
Hebbal                                          176
Marathahalli                                    175
Raja Rajeshwari Nagar                           171
Bannerghatta Road                               152
Hennur Road                                     150
7th Phase JP Nagar                              149
Haralur Road                                    141
Electronic City Phase II                        131
Rajaji Nagar                                    106
Chandapura                                       98
Bellandur                                        96
Hoodi                                            88
KR Puram                                         88
Electronics City Phase 1                         87
Yeshwanthpur                                     85
Begur Road                                       84
Sarjapur                                         81
Kasavanhalli                                     79
Harlur                                           79
Banashankari                                     74
Hormavu                                          74
Kengeri                                          73
Ramamurthy Nagar                                 73
                                               ... 
white field,kadugodi                              1
Kanakapura Main Road                              1
Kanakapura  Rod                                   1
Kanakapur main road                               1
Kanakadasa Layout                                 1
Kamdhenu Nagar                                    1
Kalkere Channasandra                              1
Kalhalli                                          1
Kengeri Satellite Town Stage II                   1
Kodanda Reddy Layout                              1
Malimakanapura                                    1
Konappana Agrahara                                1
Mailasandra                                       1
Maheswari Nagar                                   1
Madanayakahalli                                   1
MRCR Layout                                       1
MM Layout                                         1
MEI layout, Bagalgunte                            1
M.G Road                                          1
M C Layout                                        1
Laxminarayana Layout                              1
Lalbagh Road                                      1
Lakshmipura Vidyaanyapura                         1
Lakshminarayanapura, Electronic City Phase 2      1
Lakkasandra Extension                             1
LIC Colony                                        1
Kuvempu Layout                                    1
Kumbhena Agrahara                                 1
Kudlu Village,                                    1
1 Annasandrapalya                                 1
Name: location, Length: 1293, dtype: int64

some locations only have one data point, while other have maximum i.e 535 rows
we can come up with a reasoning that if we have less than 10 observations , we can call that “other” location.
This will help us reduce the dimensionality problem

print(f"As we can observe there are {len(location_stats[location_stats<=10])} locations have less than 10 locations and we just bin these into one category 'other'")

As we can observe there are 1052 locations have less than 10 locations and we just bin these into one category 'other'

less_than_10_locs = location_stats[location_stats<=10]
less_than_10_locs # these we can put in a general category called other

location
BTM 1st Stage                                   10
Basapura                                        10
Sector 1 HSR Layout                             10
Naganathapura                                   10
Kalkere                                         10
Nagadevanahalli                                 10
Nagappa Reddy Layout                            10
Sadashiva Nagar                                 10
Gunjur Palya                                    10
Dairy Circle                                    10
Ganga Nagar                                     10
Dodsworth Layout                                10
1st Block Koramangala                           10
Chandra Layout                                   9
Jakkur Plantation                                9
2nd Phase JP Nagar                               9
Yemlur                                           9
Mathikere                                        9
Medahalli                                        9
Volagerekallahalli                               9
4th Block Koramangala                            9
Vishwanatha Nagenahalli                          9
B Narayanapura                                   9
KUDLU MAIN ROAD                                  9
Ejipura                                          9
Vignana Nagar                                    9
Peenya                                           9
Kaverappa Layout                                 9
Banagiri Nagar                                   9
Gollahalli                                       9
                                                ..
white field,kadugodi                             1
Kanakapura Main Road                             1
Kanakapura  Rod                                  1
Kanakapur main road                              1
Kanakadasa Layout                                1
Kamdhenu Nagar                                   1
Kalkere Channasandra                             1
Kalhalli                                         1
Kengeri Satellite Town Stage II                  1
Kodanda Reddy Layout                             1
Malimakanapura                                   1
Konappana Agrahara                               1
Mailasandra                                      1
Maheswari Nagar                                  1
Madanayakahalli                                  1
MRCR Layout                                      1
MM Layout                                        1
MEI layout, Bagalgunte                           1
M.G Road                                         1
M C Layout                                       1
Laxminarayana Layout                             1
Lalbagh Road                                     1
Lakshmipura Vidyaanyapura                        1
Lakshminarayanapura, Electronic City Phase 2     1
Lakkasandra Extension                            1
LIC Colony                                       1
Kuvempu Layout                                   1
Kumbhena Agrahara                                1
Kudlu Village,                                   1
1 Annasandrapalya                                1
Name: location, Length: 1052, dtype: int64

df2["location"] = df2.location.apply(lambda x: "other" if x in less_than_10_locs else x) # we create other location category

print(f"After the above transformation, the number of locations has been reduced to {len(df2.location.unique())}! which a simpler dimention than before")

After the above transformation, the number of locations has been reduced to 242! which a simpler dimention than before

Outlier removal

As seen earlier, some column values did’nt make much sense. For example we had properties with 43 bedrooms occupying a small sqft value.
Such scenarios would be indicative of an anomaly. These anomalies should be taken care of as they would affect our modeling.
Let’s investigate the sqft_per_room

df2[(df2.total_sqft/df2["BHK"])<300].head()

	location	size	BHK	total_sqft	bath	price_per_sqft	price
9	other	6 Bedroom	6	1020.0	6.0	36274.509804	370.0
45	HSR Layout	8 Bedroom	8	600.0	9.0	33333.333333	200.0
58	Murugeshpalya	6 Bedroom	6	1407.0	4.0	10660.980810	150.0
68	Devarachikkanahalli	8 Bedroom	8	1350.0	7.0	6296.296296	85.0
70	other	3 Bedroom	3	500.0	3.0	20000.000000	100.0

here we can see that we have some cases where a property has 8 rooms which are less than 300 sqft
In other words it is questionable that a house of 8 rooms would fit into a plot of size 600 sqft or 55 sqm.
this is an error or an anomaly we should remove

print("we have {} of these outliers".format(len(df2[(df2.total_sqft/df2["BHK"])<300])))

we have 744 of these outliers

df3 = df2[~(df2.total_sqft/df2["BHK"]<300)]
df3.shape

(12502, 7)

let’s also investigate the price per sqft

df3.columns=df3.columns.str.lower() # change col names to lower case
df3.head()

	location	size	bhk	total_sqft	bath	price_per_sqft	price
0	Electronic City Phase II	2 BHK	2	1056.0	2.0	3699.810606	39.07
1	Chikka Tirupathi	4 Bedroom	4	2600.0	5.0	4615.384615	120.00
2	Uttarahalli	3 BHK	3	1440.0	2.0	4305.555556	62.00
3	Lingadheeranahalli	3 BHK	3	1521.0	3.0	6245.890861	95.00
4	Kothanur	2 BHK	2	1200.0	2.0	4250.000000	51.00

df3.price_per_sqft.describe()

count     12456.000000
mean       6308.502826
std        4168.127339
min         267.829813
25%        4210.526316
50%        5294.117647
75%        6916.666667
max      176470.588235
Name: price_per_sqft, dtype: float64

we can see that the lowest value 267, which might be too low for a property in the silicon valley of india
also the maximum value is too extreme although possible. we might want to remove such extremes as they might affect the modeling
let’s remove values beyond 1 STD from the mean
we will remove these outliers per mean and std of each location since some locations will have a higer price while others will be less expensive

# function to do the above
def remove_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby("location"):
        m = np.mean(subdf.price_per_sqft) # mean
        sd = np.std(subdf.price_per_sqft) # standard deviation
        reduced_df = subdf[(subdf.price_per_sqft>(m-sd)) & (subdf.price_per_sqft<=(m+sd))] # keep everying between interval
        df_out = pd.concat([df_out, reduced_df], ignore_index=True)
    return df_out    
    

# apply above fn
df4 = remove_outliers(df3)
df4.shape

(10241, 7)

lets vizualize the price for 2 and 3 bedrooms per sqft area to see if we have any interesting observations

def plot_scatter_chart(df, location):
    sns.set() # for better plots
    bhk2 = df[(df.location == location) & (df.bhk==2)]
    bhk3 = df[(df.location == location) & (df.bhk==3)]
    plt.rcParams["figure.figsize"] =(12,10)
    plt.scatter(bhk2.total_sqft, bhk2.price, color="orange", label="2 bedrooms", s=50) # 2 bedrooms
    plt.scatter(bhk3.total_sqft, bhk3.price, marker="+",color="maroon", label="3 bedrooms", s=50) # 3 bedrooms
    plt.xlabel("total sqft area")
    plt.ylabel("price")
    plt.title(location)
    plt.legend();
    
    
    
plot_scatter_chart(df4, "Rajaji Nagar")    
    

png

Around 1700 sqft it seems unusual that 2 bedrooms will be more expensive than a 3 bedroom. This can be another case of outliers that need to be removed.
let’s look at other observations and see if this trend is common

plot_scatter_chart(df4, "Hebbal") 

png

plot_scatter_chart(df4, "Uttarahalli")

png

we can see that these type outlier present themselves more or less commonly.
we can write a function to remove these outliers
in other words if the price of a 3 bedroom is less than a 2 bedroom, we can remove those intsances

# this fn performs the above objectives
def remove_bhk_outlier(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby("location"):
        bhk_stats = {} # generate some stats
        for bhk, bhk_df in location_df.groupby("bhk"):
            bhk_stats[bhk] = {
                "mean": np.mean(bhk_df.price_per_sqft),
                "std": np.std(bhk_df.price_per_sqft),
                "count": bhk_df.shape[0]
            }
        for bhk, bhk_df in location_df.groupby("bhk"): 
            stats = bhk_stats.get(bhk-1)
            if stats and stats["count"]>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft < (stats["mean"])].index.values)
    return df.drop(exclude_indices, axis="index")            

df5 = remove_bhk_outlier(df4)

df5.shape

(7329, 7)

now lets see what the function we wrote did!

plot_scatter_chart(df5, "Uttarahalli") # for the prev plot

png

now there is a descent removal of the outliers

# lets also visualize the number of bathrooms
plt.figure(figsize=(6,5))
plt.hist(df5.bath, rwidth=0.8)
plt.title("bath room counts")
plt.xlabel("number of baths")
plt.ylabel("counts");

png

we can see that most residential properties have 2 - 5 bath rooms with few outliers
let’s try to remove the bathroom outlier
for this we shall use the criteria that if the number of bathrooms is more than the number of bedrooms plus 2, we take that as an outlier

df5[df5.bath>df5.bhk+2] # some of the bathroom outliers

	location	size	bhk	total_sqft	bath	price_per_sqft	price
1626	Chikkabanavar	4 Bedroom	4	2460.0	7.0	3252.032520	80.0
5238	Nagasandra	4 Bedroom	4	7000.0	8.0	6428.571429	450.0
6711	Thanisandra	3 BHK	3	1806.0	6.0	6423.034330	116.0
8411	other	6 BHK	6	11338.0	9.0	8819.897689	1000.0

we can see that sometimes we have an apartment with 7 or 8 bathrooms which is unusual

df6 = df5[df5.bath<df5.bhk+2] # removed outliers df

lets also drop some unneccessary colums like price_per_sqft, and size

df7 = df6[['location', 'bhk', 'total_sqft', 'bath', 'price']]
df7.head()          

	location	bhk	total_sqft	bath	price
0	1st Block Jayanagar	4	2850.0	4.0	428.0
1	1st Block Jayanagar	3	1630.0	3.0	194.0
2	1st Block Jayanagar	3	1875.0	2.0	235.0
3	1st Block Jayanagar	3	1200.0	2.0	130.0
4	1st Block Jayanagar	2	1235.0	2.0	148.0

Inferential statistics

Normal distribution

Here a normal ditribution is a bell-shaped probablity density function (pdf) that is symetric about the mean, showing that data data about the mean are more frequent in occurance than data away from the mean.
we check the skewness of the target variable by fitting this ditribution and seeing which side it lies (left or right)

plt.rcParams['figure.figsize'] = (11, 9)
plt.xticks(rotation=30)
sns.distplot(df7['price'])
plt.title('Distribution of Target Column')
plt.show()

png

we can see that the price or target variable is skewed to the right
the price is not normaly distributed because of outliers

Sample Mean and population Mean

# lets randomly sample the price of 500 houses and compre this to the population mean
samples = np.random.choice(a=df7["price"],size=500)
population_mean = np.mean(df7["price"])

print(f"population mean is: {round(population_mean,3)} \nsample mean is: {round(np.mean(samples),3)}")

population mean is: 96.506 
sample mean is: 101.852

The sample mean is usually not exactly the same as the population mean. This difference can be caused by many factors including poor survey design, biased sampling methods and the randomness inherent to drawing a sample from a population.

Confidence interval

sample_size = 1000
samples = np.random.choice(a=df7["price"],size=sample_size) # let's get a huge sample size

sample_mean = np.mean(samples)

# get critcal z-value
z_critical = stats.norm.ppf(q=0.95) # 95 percentile

pop_std = np.std(df7["price"]) # pop standard dev

# checking the margin of error
margin_of_error = z_critical * (pop_std/math.sqrt(sample_size)) 

# defining our confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error) # 95% confidence interval

print(f"the critical z value is {z_critical} \nthe 95% CI is {confidence_interval} \nthe true population mean is {population_mean}")

the critical z value is 1.6448536269514722 
the 95% CI is (88.57412780284928, 97.69427219715071) 
the true population mean is 96.50611846641833

the true mean is contained within the CI
confidence interval of 95% would mean that if we take many samples and create confidence intervals for each of them, 95% of our samples’ confidence intervals will contain the true population mean.
we can also visualize several CI and how they captupre the mean

sample_size = 500

intervals = []
sample_means = []

for sample in range(25):
    sample = np.random.choice(a= df7['price'], size = sample_size)
    sample_mean = sample.mean()
    sample_means.append(sample_mean)

     # Get the z-critical value* 
    z_critical = stats.norm.ppf(q = 0.975)         

    # Get the population standard deviation
    pop_std = df7['price'].std()  

    stats.norm.ppf(q = 0.025)

    margin_of_error = z_critical * (pop_std/math.sqrt(sample_size))

    confidence_interval = (sample_mean - margin_of_error,
                           sample_mean + margin_of_error)  
    
    intervals.append(confidence_interval)
    

plt.figure(figsize=(13, 9))
plt.errorbar(x=np.arange(0.1, 25, 1), 
             y=sample_means, 
             yerr=[(top-bot)/2 for top,bot in intervals],
             fmt='o')

plt.hlines(xmin=0, xmax=25,
           y=df7['price'].mean(), 
           linewidth=2.0,
           color="red")
plt.title('Confidence Intervals for 25 Trials', fontsize = 20)
plt.show()

png

It is easily visible that 95% of the times the blue lines(the sample mean) overlaps with the red line(the true mean), also 5% of the times it is expected to not overlap with the red line(the true mean).

Hypothesis testing

$\alpha$ = 0.05

$H_0$ : $\mu_0$ = $\mu_1$ equal means in price for all rooms

$H_1$ : $\mu_0$ $\neq$ $\mu_1$

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 1 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.0 and the Z-statistic is -50.83678679662534

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 2 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.0 and the Z-statistic is -78.68901645622611

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 3 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 1.5976796313461434e-67 and the Z-statistic is 17.362101110701

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 4 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 2.7801540796510378e-92 and the Z-statistic is 20.37512315015866

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 5 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 1.6776932582667467e-17 and the Z-statistic is 8.514182704074315

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 6 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 7.790588657266502e-13 and the Z-statistic is 7.16478999563192

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 7 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.17705889634431016 and the Z-statistic is 1.3498662280065419

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 8 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.000813814545059989 and the Z-statistic is 3.348052972037006

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 9 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.005881467511852069 and the Z-statistic is 2.754317533639393

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 10 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is nan and the Z-statistic is nan

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 11]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is 0.13117985558847922 and the Z-statistic is 1.5094655384150635

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 13 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is nan and the Z-statistic is nan

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 16 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")

The p-value is nan and the Z-statistic is nan

p-value less than $\alpha \le$ 0.05 means that we have enough evidence to reject Null hypothesis of equal means of price in favour of the alternative hypothesis
interstingly we find that the mean price for most apartments is not the same
However, the case of 9 and 11 bedroom apartment had p-values greater than 0.05 for which we do not reject the Null hypothesis
the conclusion is that, the number of rooms has an effect on the price

Predictive Modeling

df7.head()

	location	bhk	total_sqft	bath	price
0	1st Block Jayanagar	4	2850.0	4.0	428.0
1	1st Block Jayanagar	3	1630.0	3.0	194.0
2	1st Block Jayanagar	3	1875.0	2.0	235.0
3	1st Block Jayanagar	3	1200.0	2.0	130.0
4	1st Block Jayanagar	2	1235.0	2.0	148.0

machine learnint algorithms don’t work with text data. we need to convert the location varaible in a vector using One Hot Encoding

dumies = pd.get_dummies(df7.location, drop_first=True) # sparse matrix

df8 = pd.concat(objs=[df7, dumies], axis="columns")

df8.drop(columns=["location"], inplace=True) # drop column

df8.head()

	bhk	total_sqft	bath	price	...
0	4	2850.0	4.0	428.0	...
1	3	1630.0	3.0	194.0	...
2	3	1875.0	2.0	235.0	...
3	3	1200.0	2.0	130.0	...
4	2	1235.0	2.0	148.0	...

5 rows × 245 columns

X = df8.drop(columns="price", axis=1) # features
y = df8["price"] # target

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=67)

lin_reg = LinearRegression()
lin_reg.fit(X_train, y_train)
print(f"Train accuracy: {lin_reg.score(X_train, y_train)} \nTest accuracy: {lin_reg.score(X_test,y_test)}")

Train accuracy: 0.8558407211598535 
Test accuracy: 0.8334186940717209

let try to implement cross validation to see how the model performs

split  = ShuffleSplit(n=X.shape[0],n_iter=5, test_size=0.2,random_state=2)

cross_val_score(estimator=LinearRegression(), X=X, y=y, cv = split)

array([0.86737231, 0.85817913, 0.86058531, 0.79396905, 0.87042283])

the model give a stable performance
can we improve on these results?

# now lets build a function to predict the price 
def predict_price(location, sqft, bath, bhk):
    loc_idx = np.where(X.columns==location)[0][0] # returns index
    x = np.zeros(len(X.columns))
    x[0]=sqft
    x[1]=bath
    x[2]=bhk
    if loc_idx >= 0:
        x[loc_idx] = 1
    return lin_reg.predict([x])[0]    

# 
np.where(X.columns=="1st Phase JP Nagar")[0][0]

def predict_price(location,sqft,bath,bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = bhk
    x[1] = sqft
    x[2] = bath
    if loc_index >= 0:
        x[loc_index] = 1

    return lin_reg.predict([x])[0]

predict_price('Indira Nagar',1000, 2, 2) # predicting the price of a home in a given location

188.99663978980522

export model for deployment

import pickle
import json

with open("bangalore_real_estate_estimator.pickle", mode="wb") as f:
    pickle.dump(lin_reg,f)

columns = {
    "data_columns": [col.lower() for col in X.columns]
}

with open("columns.json", mode="w") as f:
    f.write(json.dumps(columns))

this model is now ready for production!

References

[1] https://en.wikipedia.org/wiki/Bangalore.
[2] https://www.machinehack.com/course/predicting- house- prices- in-bengaluru/
[3] https://www.kaggle.com/amitabhajoy/bengaluru-house-price-data.Sheikh, Wasim, Dash, Mihir, and Sharma, Kshitiz
[4] Trends in Residential Marketin Bangalore, India.doi:10.13140/RG.2.2.33967.89768.

Share on

Twitter Facebook LinkedIn

John Bosco