Predictive Modeling: House price analytics Bangalore, India

22 minute read

Housing situation in Bangalore, India.

Housing is increasingly becoming a basic need for every human being. This mini paper seeks to provide some insight into this basic need by exploring and analysing the housing property market in Bangalore. Bangalore, sometimes called the silicon valley of India, is the state capital of Karnataka. Located in south, the state is host to over 10 million people making it the third most popular city in India. In terms of activity, it plays a role as the nations chief technology hub and exporter. This is also likely to mean that Bangalore provides a slew of employment opportunities to the Indian population. It is also reasonable to assume this scenario could stress the housing situation. This is not surprising because large cities offer better jobs with higher incomes. According to the paper of Sheikh [1], the property market in India differs drastically from the rest of the world as it experiences rapid growth among other factors. Property value on the other hand can be difficult to determine. It is not that apparent what drives the cost of residential property in a large and developed city. In addition to this, although budget is a determining factor in property buying, it’s not also clear what preferences influence the residential buyer or which ones they choose to prioritize. This mini paper puts some variables under study such as; the area type, availability, location, size, society, total square feet, number of bath rooms, balcony availability and price to understand the scenario in Bangalore.

Problem statement

Buying a home in Bangalore is especially a tricky choice. Buyer choice can be inspired by different aspects as such it’s difficult to ascertain property price. This leads to the question, what characteristics does a potential residence buyer consider before making purchase? The answer to this question will give us an understanding into the buyer dynamics in the metro of Bangalore.


Since Bangalore is a silicon valley with a slew of opportunities for many Indians it’s more reasonable to think that people will find it pleasant to live just about anywhere. We hypothesize that the number of rooms has no effect on the sale price.

Data source

The data to be used were curated by a specialized team in India over months of primary and secondary research. The data are also publicly available online and distributed under the creative commons license on the Kaggle platform [3]. The variables under scrutiny are either categorical or continuous and cover details like; the area type which describes the type build in an area. Availability which indicates whether a house is available for possession or when it will be ready. Location which tells us where the residential property is situated in the metro. For example, along an airport highway. Price, which tells us the commercial value of the asset in lakhs or Indian rupee. Size, which refers to the number of bedrooms in a particular residential property. Bath tells us how many bath rooms a residence. Total square feet which gives a hint at the area the property occupies. The remaining variables like balcony and society detail how many balconies are on a property and which social group the property belongs.


After acquiring the the data, the next steps will be to clean it (remove any outliers, convert any categorical variables one hot encoded vectors), understand the variables (their distributions) , perform some inferential statistics and lastly perform predictive modeling using linear regression. The software environment for this project will be python using an IDE or Integrated development environment (python interpreter).

Now let’s jump right into the code!

# import convinience functions

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# statistic package
import scipy.stats as stats 
from statsmodels.stats.weightstats import ztest
import math

import warnings
warnings.filterwarnings(action="ignore") # turn off warnings

# import learners and other dependancies
from sklearn.model_selection import train_test_split
from sklearn.cross_validation import cross_val_score, ShuffleSplit
from sklearn.linear_model import LinearRegression, ridge_regression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

%matplotlib inline
# import data and read first 5 rows
df = pd.read_csv("Bengaluru_House_Data.csv")
area_type availability location size society total_sqft bath balcony price
0 Super built-up Area 19-Dec Electronic City Phase II 2 BHK Coomee 1056 2.0 1.0 39.07
1 Plot Area Ready To Move Chikka Tirupathi 4 Bedroom Theanmp 2600 5.0 3.0 120.00
2 Built-up Area Ready To Move Uttarahalli 3 BHK NaN 1440 2.0 3.0 62.00
3 Super built-up Area Ready To Move Lingadheeranahalli 3 BHK Soiewre 1521 3.0 1.0 95.00
4 Super built-up Area Ready To Move Kothanur 2 BHK NaN 1200 2.0 1.0 51.00
print(f"The dataset has {df.shape[0]} observations")
The dataset has 13320 observations
  • There are some ethical issues to consider so we make a simple assumption that when a house will be available, the society it belongs to, the number of balconies and it’s area type will not be used to determine the final price.
cols_to_drop = [
df_final = df.drop(columns=cols_to_drop)
location size total_sqft bath price
0 Electronic City Phase II 2 BHK 1056 2.0 39.07
1 Chikka Tirupathi 4 Bedroom 2600 5.0 120.00
2 Uttarahalli 3 BHK 1440 2.0 62.00
3 Lingadheeranahalli 3 BHK 1521 3.0 95.00
4 Kothanur 2 BHK 1200 2.0 51.00

Data cleaning

df_final.isna().sum() # check the number of missing values
location       1
size          16
total_sqft     0
bath          73
price          0
dtype: int64
  • the observation here is that we have very few missing values compare to 13000 observations so we can just drop them to make the analysis simpler
df_final.dropna(inplace=True) # drop missing values
location size total_sqft bath price
0 Electronic City Phase II 2 BHK 1056 2.0 39.07
1 Chikka Tirupathi 4 Bedroom 2600 5.0 120.00
2 Uttarahalli 3 BHK 1440 2.0 62.00
3 Lingadheeranahalli 3 BHK 1521 3.0 95.00
4 Kothanur 2 BHK 1200 2.0 51.00
  • upon droping missing values, we oberve that size has a wierd naming skim
  • we can inspect this and find a way to correct it
array(['2 BHK', '4 Bedroom', '3 BHK', '4 BHK', '6 Bedroom', '3 Bedroom',
       '1 BHK', '1 RK', '1 Bedroom', '8 Bedroom', '2 Bedroom',
       '7 Bedroom', '5 BHK', '7 BHK', '6 BHK', '5 Bedroom', '11 BHK',
       '9 BHK', '9 Bedroom', '27 BHK', '10 Bedroom', '11 Bedroom',
       '10 BHK', '19 BHK', '16 BHK', '43 Bedroom', '14 BHK', '8 BHK',
       '12 Bedroom', '13 BHK', '18 Bedroom'], dtype=object)
# we can use a simple function to remove the number of rooms from the column "size"
df_final["BHK"] = df_final["size"].apply(lambda x : int(x.split(" ")[0])) # convert no. of rooms to int
df_final = df_final[['location', 'size','BHK', 'total_sqft', 'bath',  'price']] # re-arrange cols
location size BHK total_sqft bath price
0 Electronic City Phase II 2 BHK 2 1056 2.0 39.07
1 Chikka Tirupathi 4 Bedroom 4 2600 5.0 120.00
2 Uttarahalli 3 BHK 3 1440 2.0 62.00
3 Lingadheeranahalli 3 BHK 3 1521 3.0 95.00
4 Kothanur 2 BHK 2 1200 2.0 51.00
# let's inspect the number of rooms 
array([ 2,  4,  3,  6,  1,  8,  7,  5, 11,  9, 27, 10, 19, 16, 43, 14, 12,
       13, 18])
  • some houses have 43 bedrooms. It would also be interesting to look at the total sqft occupied by these kind of houses with more than 10 rooms and see if there is any relationship.
location size BHK total_sqft bath price
459 1 Giri Nagar 11 BHK 11 5000 9.0 360.0
1718 2Electronic City Phase II 27 BHK 27 8000 27.0 230.0
1768 1 Ramamurthy Nagar 11 Bedroom 11 1200 11.0 170.0
3379 1Hanuman Nagar 19 BHK 19 2000 16.0 490.0
3609 Koramangala Industrial Layout 16 BHK 16 10000 16.0 550.0
3853 1 Annasandrapalya 11 Bedroom 11 1200 6.0 150.0
4684 Munnekollal 43 Bedroom 43 2400 40.0 660.0
4916 1Channasandra 14 BHK 14 1250 15.0 125.0
6533 Mysore Road 12 Bedroom 12 2232 6.0 300.0
7979 1 Immadihalli 11 BHK 11 6000 12.0 150.0
9935 1Hoysalanagar 13 BHK 13 5425 13.0 275.0
11559 1Kasavanhalli 18 Bedroom 18 1200 18.0 200.0
  • it is questionable if a house of 43 bedrooms will have 40 bathrooms and also if occupying a total area of 2400 makes much sense?
array(['1056', '2600', '1440', ..., '1133 - 1384', '774', '4689'],
  • some of the values in total_sqft are represented in a range, we might need a single value for this. Let’s take the avarage of such instances
df = df_final.copy()
def is_float(x):
        return False
    return True
location size BHK total_sqft bath price
30 Yelahanka 4 BHK 4 2100 - 2850 4.0 186.000
122 Hebbal 4 BHK 4 3067 - 8156 4.0 477.000
137 8th Phase JP Nagar 2 BHK 2 1042 - 1105 2.0 54.005
165 Sarjapur 2 BHK 2 1145 - 1340 2.0 43.490
188 KR Puram 2 BHK 2 1015 - 1540 2.0 56.800
410 Kengeri 1 BHK 1 34.46Sq. Meter 1.0 18.500
549 Hennur Road 2 BHK 2 1195 - 1440 2.0 63.770
648 Arekere 9 Bedroom 9 4125Perch 9.0 265.000
661 Yelahanka 2 BHK 2 1120 - 1145 2.0 48.130
672 Bettahalsoor 4 Bedroom 4 3090 - 5002 4.0 445.000
def convert_sqft_to_num(x):
    tokens = x.split("-")
    if len(tokens) == 2:
        return (  float(tokens[0]) + float(tokens[1])  )/2
        return float(x)
        return None
df1 = df.copy()
df1["total_sqft"] = df1["total_sqft"].apply(convert_sqft_to_num)
location size BHK total_sqft bath price
0 Electronic City Phase II 2 BHK 2 1056.0 2.0 39.07
1 Chikka Tirupathi 4 Bedroom 4 2600.0 5.0 120.00
2 Uttarahalli 3 BHK 3 1440.0 2.0 62.00
3 Lingadheeranahalli 3 BHK 3 1521.0 3.0 95.00
4 Kothanur 2 BHK 2 1200.0 2.0 51.00
  • we have handled the total_sqft column

Feature Engineering & dimensionality reduction

df2 = df1.copy() # deep copy
df2["price_per_sqft"] = df2["price"]*100000/df2["total_sqft"]
df2 = df2[['location', 'size', 'BHK', 'total_sqft', 'bath', 'price_per_sqft', 'price']] # rearrange columns
location size BHK total_sqft bath price_per_sqft price
0 Electronic City Phase II 2 BHK 2 1056.0 2.0 3699.810606 39.07
1 Chikka Tirupathi 4 Bedroom 4 2600.0 5.0 4615.384615 120.00
2 Uttarahalli 3 BHK 3 1440.0 2.0 4305.555556 62.00
3 Lingadheeranahalli 3 BHK 3 1521.0 3.0 6245.890861 95.00
4 Kothanur 2 BHK 2 1200.0 2.0 4250.000000 51.00
  • let’s go to the location column and inspect whats going on with it
print(f"we have { len(df2.location.unique()) } unique locations")
we have 1304 unique locations
  • we have about 1300 unique locations in the data. One way to deal with categorical data is to use one hot encoding. However, there are just to many levels for that kind of convertion and this would bring about the curse of dimensionality!

  • we can avoid the curse of dimensionality by binning those locations with few observations into their own category.

df2["location"] = df2["location"].apply(lambda x: x.strip()) # remove trailing and leading whitespace
location_stats = df2.groupby("location")["location"].count().sort_values(ascending=False)
location_stats # lets see the location counts from highest to lowest
Whitefield                                      535
Sarjapur  Road                                  392
Electronic City                                 304
Kanakpura Road                                  266
Thanisandra                                     236
Yelahanka                                       210
Uttarahalli                                     186
Hebbal                                          176
Marathahalli                                    175
Raja Rajeshwari Nagar                           171
Bannerghatta Road                               152
Hennur Road                                     150
7th Phase JP Nagar                              149
Haralur Road                                    141
Electronic City Phase II                        131
Rajaji Nagar                                    106
Chandapura                                       98
Bellandur                                        96
Hoodi                                            88
KR Puram                                         88
Electronics City Phase 1                         87
Yeshwanthpur                                     85
Begur Road                                       84
Sarjapur                                         81
Kasavanhalli                                     79
Harlur                                           79
Banashankari                                     74
Hormavu                                          74
Kengeri                                          73
Ramamurthy Nagar                                 73
white field,kadugodi                              1
Kanakapura Main Road                              1
Kanakapura  Rod                                   1
Kanakapur main road                               1
Kanakadasa Layout                                 1
Kamdhenu Nagar                                    1
Kalkere Channasandra                              1
Kalhalli                                          1
Kengeri Satellite Town Stage II                   1
Kodanda Reddy Layout                              1
Malimakanapura                                    1
Konappana Agrahara                                1
Mailasandra                                       1
Maheswari Nagar                                   1
Madanayakahalli                                   1
MRCR Layout                                       1
MM Layout                                         1
MEI layout, Bagalgunte                            1
M.G Road                                          1
M C Layout                                        1
Laxminarayana Layout                              1
Lalbagh Road                                      1
Lakshmipura Vidyaanyapura                         1
Lakshminarayanapura, Electronic City Phase 2      1
Lakkasandra Extension                             1
LIC Colony                                        1
Kuvempu Layout                                    1
Kumbhena Agrahara                                 1
Kudlu Village,                                    1
1 Annasandrapalya                                 1
Name: location, Length: 1293, dtype: int64
  • some locations only have one data point, while other have maximum i.e 535 rows
  • we can come up with a reasoning that if we have less than 10 observations , we can call that “other” location.
  • This will help us reduce the dimensionality problem
print(f"As we can observe there are {len(location_stats[location_stats<=10])} locations have less than 10 locations and we just bin these into one category 'other'")
As we can observe there are 1052 locations have less than 10 locations and we just bin these into one category 'other'
less_than_10_locs = location_stats[location_stats<=10]
less_than_10_locs # these we can put in a general category called other
BTM 1st Stage                                   10
Basapura                                        10
Sector 1 HSR Layout                             10
Naganathapura                                   10
Kalkere                                         10
Nagadevanahalli                                 10
Nagappa Reddy Layout                            10
Sadashiva Nagar                                 10
Gunjur Palya                                    10
Dairy Circle                                    10
Ganga Nagar                                     10
Dodsworth Layout                                10
1st Block Koramangala                           10
Chandra Layout                                   9
Jakkur Plantation                                9
2nd Phase JP Nagar                               9
Yemlur                                           9
Mathikere                                        9
Medahalli                                        9
Volagerekallahalli                               9
4th Block Koramangala                            9
Vishwanatha Nagenahalli                          9
B Narayanapura                                   9
KUDLU MAIN ROAD                                  9
Ejipura                                          9
Vignana Nagar                                    9
Peenya                                           9
Kaverappa Layout                                 9
Banagiri Nagar                                   9
Gollahalli                                       9
white field,kadugodi                             1
Kanakapura Main Road                             1
Kanakapura  Rod                                  1
Kanakapur main road                              1
Kanakadasa Layout                                1
Kamdhenu Nagar                                   1
Kalkere Channasandra                             1
Kalhalli                                         1
Kengeri Satellite Town Stage II                  1
Kodanda Reddy Layout                             1
Malimakanapura                                   1
Konappana Agrahara                               1
Mailasandra                                      1
Maheswari Nagar                                  1
Madanayakahalli                                  1
MRCR Layout                                      1
MM Layout                                        1
MEI layout, Bagalgunte                           1
M.G Road                                         1
M C Layout                                       1
Laxminarayana Layout                             1
Lalbagh Road                                     1
Lakshmipura Vidyaanyapura                        1
Lakshminarayanapura, Electronic City Phase 2     1
Lakkasandra Extension                            1
LIC Colony                                       1
Kuvempu Layout                                   1
Kumbhena Agrahara                                1
Kudlu Village,                                   1
1 Annasandrapalya                                1
Name: location, Length: 1052, dtype: int64
df2["location"] = df2.location.apply(lambda x: "other" if x in less_than_10_locs else x) # we create other location category
print(f"After the above transformation, the number of locations has been reduced to {len(df2.location.unique())}! which a simpler dimention than before")
After the above transformation, the number of locations has been reduced to 242! which a simpler dimention than before

Outlier removal

  • As seen earlier, some column values did’nt make much sense. For example we had properties with 43 bedrooms occupying a small sqft value.
  • Such scenarios would be indicative of an anomaly. These anomalies should be taken care of as they would affect our modeling.
  • Let’s investigate the sqft_per_room
location size BHK total_sqft bath price_per_sqft price
9 other 6 Bedroom 6 1020.0 6.0 36274.509804 370.0
45 HSR Layout 8 Bedroom 8 600.0 9.0 33333.333333 200.0
58 Murugeshpalya 6 Bedroom 6 1407.0 4.0 10660.980810 150.0
68 Devarachikkanahalli 8 Bedroom 8 1350.0 7.0 6296.296296 85.0
70 other 3 Bedroom 3 500.0 3.0 20000.000000 100.0
  • here we can see that we have some cases where a property has 8 rooms which are less than 300 sqft
  • In other words it is questionable that a house of 8 rooms would fit into a plot of size 600 sqft or 55 sqm.
  • this is an error or an anomaly we should remove
print("we have {} of these outliers".format(len(df2[(df2.total_sqft/df2["BHK"])<300])))
we have 744 of these outliers
df3 = df2[~(df2.total_sqft/df2["BHK"]<300)]
(12502, 7)
  • let’s also investigate the price per sqft
df3.columns=df3.columns.str.lower() # change col names to lower case
location size bhk total_sqft bath price_per_sqft price
0 Electronic City Phase II 2 BHK 2 1056.0 2.0 3699.810606 39.07
1 Chikka Tirupathi 4 Bedroom 4 2600.0 5.0 4615.384615 120.00
2 Uttarahalli 3 BHK 3 1440.0 2.0 4305.555556 62.00
3 Lingadheeranahalli 3 BHK 3 1521.0 3.0 6245.890861 95.00
4 Kothanur 2 BHK 2 1200.0 2.0 4250.000000 51.00
count     12456.000000
mean       6308.502826
std        4168.127339
min         267.829813
25%        4210.526316
50%        5294.117647
75%        6916.666667
max      176470.588235
Name: price_per_sqft, dtype: float64
  • we can see that the lowest value 267, which might be too low for a property in the silicon valley of india
  • also the maximum value is too extreme although possible. we might want to remove such extremes as they might affect the modeling
  • let’s remove values beyond 1 STD from the mean
  • we will remove these outliers per mean and std of each location since some locations will have a higer price while others will be less expensive
# function to do the above
def remove_outliers(df):
    df_out = pd.DataFrame()
    for key, subdf in df.groupby("location"):
        m = np.mean(subdf.price_per_sqft) # mean
        sd = np.std(subdf.price_per_sqft) # standard deviation
        reduced_df = subdf[(subdf.price_per_sqft>(m-sd)) & (subdf.price_per_sqft<=(m+sd))] # keep everying between interval
        df_out = pd.concat([df_out, reduced_df], ignore_index=True)
    return df_out    
# apply above fn
df4 = remove_outliers(df3)
(10241, 7)
  • lets vizualize the price for 2 and 3 bedrooms per sqft area to see if we have any interesting observations
def plot_scatter_chart(df, location):
    sns.set() # for better plots
    bhk2 = df[(df.location == location) & (df.bhk==2)]
    bhk3 = df[(df.location == location) & (df.bhk==3)]
    plt.rcParams["figure.figsize"] =(12,10)
    plt.scatter(bhk2.total_sqft, bhk2.price, color="orange", label="2 bedrooms", s=50) # 2 bedrooms
    plt.scatter(bhk3.total_sqft, bhk3.price, marker="+",color="maroon", label="3 bedrooms", s=50) # 3 bedrooms
    plt.xlabel("total sqft area")
plot_scatter_chart(df4, "Rajaji Nagar")    


  • Around 1700 sqft it seems unusual that 2 bedrooms will be more expensive than a 3 bedroom. This can be another case of outliers that need to be removed.
  • let’s look at other observations and see if this trend is common
plot_scatter_chart(df4, "Hebbal") 


plot_scatter_chart(df4, "Uttarahalli")


  • we can see that these type outlier present themselves more or less commonly.
  • we can write a function to remove these outliers
  • in other words if the price of a 3 bedroom is less than a 2 bedroom, we can remove those intsances
# this fn performs the above objectives
def remove_bhk_outlier(df):
    exclude_indices = np.array([])
    for location, location_df in df.groupby("location"):
        bhk_stats = {} # generate some stats
        for bhk, bhk_df in location_df.groupby("bhk"):
            bhk_stats[bhk] = {
                "mean": np.mean(bhk_df.price_per_sqft),
                "std": np.std(bhk_df.price_per_sqft),
                "count": bhk_df.shape[0]
        for bhk, bhk_df in location_df.groupby("bhk"): 
            stats = bhk_stats.get(bhk-1)
            if stats and stats["count"]>5:
                exclude_indices = np.append(exclude_indices, bhk_df[bhk_df.price_per_sqft < (stats["mean"])].index.values)
    return df.drop(exclude_indices, axis="index")            

df5 = remove_bhk_outlier(df4)
(7329, 7)
  • now lets see what the function we wrote did!
plot_scatter_chart(df5, "Uttarahalli") # for the prev plot


  • now there is a descent removal of the outliers
# lets also visualize the number of bathrooms
plt.hist(df5.bath, rwidth=0.8)
plt.title("bath room counts")
plt.xlabel("number of baths")


  • we can see that most residential properties have 2 - 5 bath rooms with few outliers
  • let’s try to remove the bathroom outlier
  • for this we shall use the criteria that if the number of bathrooms is more than the number of bedrooms plus 2, we take that as an outlier
df5[df5.bath>df5.bhk+2] # some of the bathroom outliers
location size bhk total_sqft bath price_per_sqft price
1626 Chikkabanavar 4 Bedroom 4 2460.0 7.0 3252.032520 80.0
5238 Nagasandra 4 Bedroom 4 7000.0 8.0 6428.571429 450.0
6711 Thanisandra 3 BHK 3 1806.0 6.0 6423.034330 116.0
8411 other 6 BHK 6 11338.0 9.0 8819.897689 1000.0
  • we can see that sometimes we have an apartment with 7 or 8 bathrooms which is unusual
df6 = df5[df5.bath<df5.bhk+2] # removed outliers df
  • lets also drop some unneccessary colums like price_per_sqft, and size
df7 = df6[['location', 'bhk', 'total_sqft', 'bath', 'price']]
location bhk total_sqft bath price
0 1st Block Jayanagar 4 2850.0 4.0 428.0
1 1st Block Jayanagar 3 1630.0 3.0 194.0
2 1st Block Jayanagar 3 1875.0 2.0 235.0
3 1st Block Jayanagar 3 1200.0 2.0 130.0
4 1st Block Jayanagar 2 1235.0 2.0 148.0

Inferential statistics

Normal distribution

  • Here a normal ditribution is a bell-shaped probablity density function (pdf) that is symetric about the mean, showing that data data about the mean are more frequent in occurance than data away from the mean.
  • we check the skewness of the target variable by fitting this ditribution and seeing which side it lies (left or right)
plt.rcParams['figure.figsize'] = (11, 9)
plt.title('Distribution of Target Column')


  • we can see that the price or target variable is skewed to the right
  • the price is not normaly distributed because of outliers

Sample Mean and population Mean

# lets randomly sample the price of 500 houses and compre this to the population mean
samples = np.random.choice(a=df7["price"],size=500)
population_mean = np.mean(df7["price"])

print(f"population mean is: {round(population_mean,3)} \nsample mean is: {round(np.mean(samples),3)}")

population mean is: 96.506 
sample mean is: 101.852
  • The sample mean is usually not exactly the same as the population mean. This difference can be caused by many factors including poor survey design, biased sampling methods and the randomness inherent to drawing a sample from a population.

Confidence interval

sample_size = 1000
samples = np.random.choice(a=df7["price"],size=sample_size) # let's get a huge sample size

sample_mean = np.mean(samples)

# get critcal z-value
z_critical = stats.norm.ppf(q=0.95) # 95 percentile

pop_std = np.std(df7["price"]) # pop standard dev

# checking the margin of error
margin_of_error = z_critical * (pop_std/math.sqrt(sample_size)) 

# defining our confidence interval
confidence_interval = (sample_mean - margin_of_error, sample_mean + margin_of_error) # 95% confidence interval

print(f"the critical z value is {z_critical} \nthe 95% CI is {confidence_interval} \nthe true population mean is {population_mean}")
the critical z value is 1.6448536269514722 
the 95% CI is (88.57412780284928, 97.69427219715071) 
the true population mean is 96.50611846641833
  • the true mean is contained within the CI
  • confidence interval of 95% would mean that if we take many samples and create confidence intervals for each of them, 95% of our samples’ confidence intervals will contain the true population mean.
  • we can also visualize several CI and how they captupre the mean
sample_size = 500

intervals = []
sample_means = []

for sample in range(25):
    sample = np.random.choice(a= df7['price'], size = sample_size)
    sample_mean = sample.mean()

     # Get the z-critical value* 
    z_critical = stats.norm.ppf(q = 0.975)         

    # Get the population standard deviation
    pop_std = df7['price'].std()  

    stats.norm.ppf(q = 0.025)

    margin_of_error = z_critical * (pop_std/math.sqrt(sample_size))

    confidence_interval = (sample_mean - margin_of_error,
                           sample_mean + margin_of_error)  

plt.figure(figsize=(13, 9))
plt.errorbar(x=np.arange(0.1, 25, 1), 
             yerr=[(top-bot)/2 for top,bot in intervals],

plt.hlines(xmin=0, xmax=25,
plt.title('Confidence Intervals for 25 Trials', fontsize = 20)


  • It is easily visible that 95% of the times the blue lines(the sample mean) overlaps with the red line(the true mean), also 5% of the times it is expected to not overlap with the red line(the true mean).

Hypothesis testing

$\alpha$ = 0.05

$H_0$ : $\mu_0$ = $\mu_1$ equal means in price for all rooms

$H_1$ : $\mu_0$ $\neq$ $\mu_1$

z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 1 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.0 and the Z-statistic is -50.83678679662534
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 2 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.0 and the Z-statistic is -78.68901645622611
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 3 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 1.5976796313461434e-67 and the Z-statistic is 17.362101110701
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 4 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 2.7801540796510378e-92 and the Z-statistic is 20.37512315015866
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 5 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 1.6776932582667467e-17 and the Z-statistic is 8.514182704074315
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 6 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 7.790588657266502e-13 and the Z-statistic is 7.16478999563192
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 7 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.17705889634431016 and the Z-statistic is 1.3498662280065419
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 8 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.000813814545059989 and the Z-statistic is 3.348052972037006
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 9 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.005881467511852069 and the Z-statistic is 2.754317533639393
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 10 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is nan and the Z-statistic is nan
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 11]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is 0.13117985558847922 and the Z-statistic is 1.5094655384150635
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 13 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is nan and the Z-statistic is nan
z_statistic, p_value = ztest(x1 = df7[df7["bhk"] == 16 ]['price'], value = df7['price'].mean())
print(f"The p-value is {p_value} and the Z-statistic is {z_statistic}")
The p-value is nan and the Z-statistic is nan
  • p-value less than $\alpha \le$ 0.05 means that we have enough evidence to reject Null hypothesis of equal means of price in favour of the alternative hypothesis
  • interstingly we find that the mean price for most apartments is not the same
  • However, the case of 9 and 11 bedroom apartment had p-values greater than 0.05 for which we do not reject the Null hypothesis
  • the conclusion is that, the number of rooms has an effect on the price

Predictive Modeling

location bhk total_sqft bath price
0 1st Block Jayanagar 4 2850.0 4.0 428.0
1 1st Block Jayanagar 3 1630.0 3.0 194.0
2 1st Block Jayanagar 3 1875.0 2.0 235.0
3 1st Block Jayanagar 3 1200.0 2.0 130.0
4 1st Block Jayanagar 2 1235.0 2.0 148.0
  • machine learnint algorithms don’t work with text data. we need to convert the location varaible in a vector using One Hot Encoding
dumies = pd.get_dummies(df7.location, drop_first=True) # sparse matrix
df8 = pd.concat(objs=[df7, dumies], axis="columns")
df8.drop(columns=["location"], inplace=True) # drop column
bhk total_sqft bath price 1st Phase JP Nagar 2nd Phase Judicial Layout 2nd Stage Nagarbhavi 5th Block Hbr Layout 5th Phase JP Nagar 6th Phase JP Nagar ... Vishveshwarya Layout Vishwapriya Layout Vittasandra Whitefield Yelachenahalli Yelahanka Yelahanka New Town Yelenahalli Yeshwanthpur other
0 4 2850.0 4.0 428.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 3 1630.0 3.0 194.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 3 1875.0 2.0 235.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 3 1200.0 2.0 130.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 2 1235.0 2.0 148.0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 245 columns

X = df8.drop(columns="price", axis=1) # features
y = df8["price"] # target
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.20, random_state=67)
lin_reg = LinearRegression(), y_train)
print(f"Train accuracy: {lin_reg.score(X_train, y_train)} \nTest accuracy: {lin_reg.score(X_test,y_test)}")
Train accuracy: 0.8558407211598535 
Test accuracy: 0.8334186940717209
  • let try to implement cross validation to see how the model performs
split  = ShuffleSplit(n=X.shape[0],n_iter=5, test_size=0.2,random_state=2)

cross_val_score(estimator=LinearRegression(), X=X, y=y, cv = split)
array([0.86737231, 0.85817913, 0.86058531, 0.79396905, 0.87042283])
  • the model give a stable performance
  • can we improve on these results?
# now lets build a function to predict the price 
def predict_price(location, sqft, bath, bhk):
    loc_idx = np.where(X.columns==location)[0][0] # returns index
    x = np.zeros(len(X.columns))
    if loc_idx >= 0:
        x[loc_idx] = 1
    return lin_reg.predict([x])[0]    
np.where(X.columns=="1st Phase JP Nagar")[0][0]
def predict_price(location,sqft,bath,bhk):    
    loc_index = np.where(X.columns==location)[0][0]

    x = np.zeros(len(X.columns))
    x[0] = bhk
    x[1] = sqft
    x[2] = bath
    if loc_index >= 0:
        x[loc_index] = 1

    return lin_reg.predict([x])[0]
predict_price('Indira Nagar',1000, 2, 2) # predicting the price of a home in a given location
  • export model for deployment
import pickle
import json
with open("bangalore_real_estate_estimator.pickle", mode="wb") as f:
columns = {
    "data_columns": [col.lower() for col in X.columns]

with open("columns.json", mode="w") as f:
  • this model is now ready for production!


  • [1]

  • [2] house- prices- in-bengaluru/

  • [3], Wasim, Dash, Mihir, and Sharma, Kshitiz

  • [4] Trends in Residential Marketin Bangalore, India.doi:10.13140/RG.2.2.33967.89768.