Machine Learning: Ordinary least squares using statsmodel api

15 minute read

Regression: Building a predictive model

Linear regression or the ordinary least sqaures method, is a supervised machine learning technique that is used to predict a continuos valued output. Everytime we fit a multiple linear regression model to our data, we compute a set of weights or coefficients, $\beta_0,\beta_1,\beta_2, \beta_3…\beta_n$ where $\beta_0$ is the intercept or the constant plus a bias term also called the error term $\epsilon$.

In this post, I will show you how to build a predictive model using the statsmodel api. Before we get into that, let’s talk about the data we will be using. There are two datasets that will be used for this predictive model i.e. County health rankings and in particular the Years of potential life lost (YPPL) and Additional measures data of which you can find the cleaned up versions on my github.

Years of potential life lost or YPPL is an estimate of the average years a person would have lived if he or she had not died prematurely. It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people.

To calculate the years of potential life lost or YPPL, an upper reference age is determined. The reference age should correspond roughly to the life expectancy of the population under study. In this data, the reference age is 75. So if a person dies at 65, their YPPL is calculated as: 75 - 65 = 10 and so on.

Now, let’s get into the data. First step of any data science project is to clean the data. This may include handling missing values if any, normalizing our data, perfoming One Hot encoding for categorical variables etc. Machine learning models do not work if our data have missing values so it’s very important that we check for missing values.

python code block to import data analysis packages:

# import data analysis libraries
import pandas as pd
import numpy as np

# import plotting library matplotlib
import matplotlib.pyplot as plt


# import ols from sklearn / predictive modeling
from statsmodels.api import OLS

# turn off future warnings
import warnings
warnings.filterwarnings(action='ignore')


# normalising packages
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler

ypll = pd.read_csv('ypll.csv') # load data
ypll.head() # view first 5 rows

	FIPS	State	County	Unreliable	YPLL Rate
0	1000	Alabama	NaN	NaN	10189.0
1	1001	Alabama	Autauga	NaN	9967.0
2	1003	Alabama	Baldwin	NaN	8322.0
3	1005	Alabama	Barbour	NaN	9559.0
4	1007	Alabama	Bibb	NaN	13283.0

add_measures = pd.read_csv('additional_measures_cleaned.csv') # import and load data
add_measures.head() # view first 5 rows

	FIPS	State	County	Population	< 18	65 and over	African American	Female	Rural	%Diabetes	HIV rate	Physical Inactivity	mental health provider rate	median household income	% high housing costs	% Free lunch	% child Illiteracy	% Drive Alone
0	1000	Alabama	NaN	4708708	23.9	13.8	26.1	51.6	44.6	12	NaN	31	20	42586.0	30	51.0	14.8	84
1	1001	Alabama	Autauga	50756	27.8	11.6	18.4	51.4	44.8	11	170.0	33	2	51622.0	25	29.0	12.7	86
2	1003	Alabama	Baldwin	179878	23.1	17.0	10.0	51.0	54.2	10	176.0	25	17	51957.0	29	29.0	10.6	83
3	1005	Alabama	Barbour	29737	22.3	13.8	46.6	46.8	71.5	14	331.0	35	7	30896.0	36	65.0	23.2	82
4	1007	Alabama	Bibb	21587	23.3	13.5	22.3	48.0	81.5	11	90.0	37	0	41076.0	18	48.0	17.5	83

ypll.info() # check dataset information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 5 columns):
FIPS          3192 non-null int64
State         3192 non-null object
County        3141 non-null object
Unreliable    196 non-null object
YPLL Rate     3097 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 124.8+ KB

# drop countries where data was deemed unreliable, marked by x
ypll.drop(index=ypll[ypll.Unreliable.isin(['x'])].index, inplace=True) 

add_measures.info() # check dataset information

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 18 columns):
FIPS                           3192 non-null int64
State                          3192 non-null object
County                         3141 non-null object
Population                     3192 non-null int64
< 18                           3192 non-null float64
65 and over                    3192 non-null float64
African American               3192 non-null float64
Female                         3192 non-null float64
Rural                          3191 non-null float64
%Diabetes                      3192 non-null int64
HIV rate                       2230 non-null float64
Physical Inactivity            3192 non-null int64
mental health provider rate    3192 non-null int64
median household income        3191 non-null float64
% high housing costs           3192 non-null int64
% Free lunch                   3173 non-null float64
% child Illiteracy             3186 non-null float64
% Drive Alone                  3192 non-null int64
dtypes: float64(9), int64(7), object(2)
memory usage: 449.0+ KB

# drop rows where column % child illiteracy has missing values
add_measures.drop(index=add_measures[add_measures['% child Illiteracy'].isnull()].index, inplace=True) 

# drop rows where column County has missing values
add_measures.drop(index=add_measures[add_measures['County'].isnull()].index, inplace=True)

ypll.columns # view list of columns

Index(['FIPS', 'State', 'County', 'Unreliable', 'YPLL Rate'], dtype='object')

cols = [
    'FIPS',
#  'State',
 'County',
 'Population',
 '< 18',
 '65 and over',
 'African American',
 'Female',
 'Rural',
 '%Diabetes',
 'HIV rate',
 'Physical Inactivity',
 'mental health provider rate',
 'median household income',
 '% high housing costs',
 '% Free lunch',
 '% child Illiteracy',
 '% Drive Alone'
] # list of columns or artributes

left = add_measures[cols] # create left dataframe
left.head() # view the first five rows

	FIPS	County	Population	< 18	65 and over	African American	Female	Rural	%Diabetes	HIV rate	Physical Inactivity	mental health provider rate	median household income	% high housing costs	% Free lunch	% child Illiteracy	% Drive Alone
1	1001	Autauga	50756	27.8	11.6	18.4	51.4	44.8	11	170.0	33	2	51622.0	25	29.0	12.7	86
2	1003	Baldwin	179878	23.1	17.0	10.0	51.0	54.2	10	176.0	25	17	51957.0	29	29.0	10.6	83
3	1005	Barbour	29737	22.3	13.8	46.6	46.8	71.5	14	331.0	35	7	30896.0	36	65.0	23.2	82
4	1007	Bibb	21587	23.3	13.5	22.3	48.0	81.5	11	90.0	37	0	41076.0	18	48.0	17.5	83
5	1009	Blount	58345	24.2	14.7	2.1	50.2	91.0	11	66.0	35	2	46086.0	21	37.0	13.9	80

# drop rows where the County is missing values
ypll.drop(index=ypll[ypll.County.isnull()].index, inplace=True)

len(ypll) # check length of dataframe

df = pd.merge(left, ypll, on=['County', 'FIPS'] ) # merge the two data sets to make one dataframe

df.head() # view the first five rows of the data

	FIPS	County	Population	< 18	65 and over	African American	Female	Rural	%Diabetes	HIV rate	Physical Inactivity	mental health provider rate	median household income	% high housing costs	% Free lunch	% child Illiteracy	% Drive Alone	State	Unreliable	YPLL Rate
0	1001	Autauga	50756	27.8	11.6	18.4	51.4	44.8	11	170.0	33	2	51622.0	25	29.0	12.7	86	Alabama	NaN	9967.0
1	1003	Baldwin	179878	23.1	17.0	10.0	51.0	54.2	10	176.0	25	17	51957.0	29	29.0	10.6	83	Alabama	NaN	8322.0
2	1005	Barbour	29737	22.3	13.8	46.6	46.8	71.5	14	331.0	35	7	30896.0	36	65.0	23.2	82	Alabama	NaN	9559.0
3	1007	Bibb	21587	23.3	13.5	22.3	48.0	81.5	11	90.0	37	0	41076.0	18	48.0	17.5	83	Alabama	NaN	13283.0
4	1009	Blount	58345	24.2	14.7	2.1	50.2	91.0	11	66.0	35	2	46086.0	21	37.0	13.9	80	Alabama	NaN	8475.0

df['YPLL Rate'].fillna(value=df['YPLL Rate'].mean(), inplace=True) # fill missing values with mean of column
df.drop(labels=['Unreliable'], inplace=True, axis=1) # drop column with NaN's (not neccesary for our analysis)
df.info() # check information of data set

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2941 entries, 0 to 2940
Data columns (total 19 columns):
FIPS                           2941 non-null int64
County                         2941 non-null object
Population                     2941 non-null int64
< 18                           2941 non-null float64
65 and over                    2941 non-null float64
African American               2941 non-null float64
Female                         2941 non-null float64
Rural                          2941 non-null float64
%Diabetes                      2941 non-null int64
HIV rate                       2220 non-null float64
Physical Inactivity            2941 non-null int64
mental health provider rate    2941 non-null int64
median household income        2941 non-null float64
% high housing costs           2941 non-null int64
% Free lunch                   2926 non-null float64
% child Illiteracy             2941 non-null float64
% Drive Alone                  2941 non-null int64
State                          2941 non-null object
YPLL Rate                      2941 non-null float64
dtypes: float64(10), int64(7), object(2)
memory usage: 459.5+ KB

HIV Rate and free lunch still have missing values and need to be imputed! Let’s impute for them

df['% Free lunch'].fillna(value=df['% Free lunch'].mean(), inplace=True) # fill in missing values for these columns
df['HIV rate'].fillna(value=df['HIV rate'].mean(), inplace=True)

Visualising correlations

import seaborn as sns # another visualisation package for pretty plots
sns.set()

fig = plt.figure(figsize=(10,15)) # create figure object
plt.tight_layout()
fig.suptitle('Visualising correlations', y=0.9, fontweight=1000) # set super title
subplot = fig.add_subplot(3,1,1) # add subplot with 3 rows and 1 column, choose axis 1

subplot.scatter(df['YPLL Rate'], df['%Diabetes'], color='orange') # make scatter plot
subplot.set_xlabel('ypll rate') # set x label
subplot.set_ylabel('% diabetes') # y label

subplot = fig.add_subplot(3,1,2) # choose axis 2
subplot.scatter(df['YPLL Rate'], df['< 18'], color='red') # scatter plot
subplot.set_xlabel('ypll rate')
subplot.set_ylabel('% < 18')


subplot = fig.add_subplot(3,1,3) # choose axis 3
subplot.scatter(df['YPLL Rate'], df['median household income'], color='green') # scatter plot
subplot.set_xlabel('ypll rate')
subplot.set_ylabel('median household income'); # set titlte

scatter plot

visualising other variables vs ypll

plot=[
 'FIPS',
 'Population',
#  '< 18',
 '65 and over',
 'African American',
 'Female',
 'Rural',
#  '%Diabetes',
 'HIV rate',
 'Physical Inactivity',
 'mental health provider rate',
#  'median household income',
 '% high housing costs',
 '% Free lunch',
 '% child Illiteracy',
 '% Drive Alone',
 'YPLL Rate'
] # columns to plot

fig = plt.figure(figsize=(15,35)) # create figure object
# plt.tight_layout()
fig.suptitle('Visualising correlations', y=0.9, fontweight=1000) # set super title

for i in range(len(plot)): # plot target against different attributes 
    
    subplot = fig.add_subplot(len(plot),3,i+1)

    subplot.scatter(df['YPLL Rate'], df[plot[i]], color='green')
    subplot.set_xlabel('ypll rate')
    subplot.set_ylabel(plot[i])

attributes vs target variable

some variables are not strongly correlated to target or output e.g. FIPS, HIV Rate and population etc
other variables like diabetes $\&$ % free lunch have a good correlation with the target variable. At this point we may start thinking that free lunch should no longer be on the offer. But perharps there is a factor playing a hideous role?
you can check the pearson correlation with the target variable in the code output of the next cell.

# correlation matrix
# sort by the top 8 values
cor_matrx=df.corr() # make a correlation matrix using pearson corr.
cor_matrx['YPLL Rate'].sort_values(ascending=False) # view correlations on scale -1 to 1

YPLL Rate                      1.000000
% Free lunch                   0.694590
%Diabetes                      0.662009
Physical Inactivity            0.645538
% child Illiteracy             0.461458
African American               0.436223
Rural                          0.286216
HIV rate                       0.175228
< 18                           0.127011
Female                         0.116916
65 and over                    0.071458
% Drive Alone                  0.031896
FIPS                          -0.062884
% high housing costs          -0.106475
Population                    -0.162204
mental health provider rate   -0.177113
median household income       -0.634813
Name: YPLL Rate, dtype: float64

YPLL vs. % Diabetes regression

# model = OLS # rename model

# define independent and dependent variable
X = df[['%Diabetes']]
y = df['YPLL Rate']

# fit model and predict using OLS (same as LinearRegression)
# show summary
# add bias term 

import statsmodels
model = OLS(y, statsmodels.tools.add_constant(X)).fit()
model.summary()

OLS Regression Results
Dep. Variable:	YPLL Rate	R-squared:	0.438
Model:	OLS	Adj. R-squared:	0.438
Method:	Least Squares	F-statistic:	2293.
Date:	Wed, 21 Aug 2019	Prob (F-statistic):	0.00
Time:	12:50:10	Log-Likelihood:	-26260.
No. Observations:	2941	AIC:	5.252e+04
Df Residuals:	2939	BIC:	5.254e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	809.4571	163.462	4.952	0.000	488.946	1129.969
%Diabetes	767.1553	16.021	47.884	0.000	735.742	798.569

Omnibus:	1081.131	Durbin-Watson:	1.545
Prob(Omnibus):	0.000	Jarque-Bera (JB):	7964.973
Skew:	1.552	Prob(JB):	0.00
Kurtosis:	10.441	Cond. No.	50.0

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Intepreting the math

let’s intepret the data and check the statistical significance to make sure nothing happend by chance and that our regression is meaningful !
Prob(F-statistic) is close to 0 which is less than 0.05 or 0.01, so we have significance & thus we can safely intepret the rest of the data !
for this model the line of best fit is represented by y = mx + constant or ypll = 767.15 * $\%$ diabetes + 809.45
To check how well this line/model performs, the R- squared value of 0.438 is recorded. This means that 43.8$\%$ of the changes in y (YPLL rate) can be predicted by changes in x ($\%$Diabetes)

what does this all mean?

Simply put, it means that without knowing the YPLL of a community, we can simply collect data about percentage of people affected by diabetes and construct 43.8$\%$ of the YPLL rate behaviour

visualize the model built above

# print cofficients
print(model.params.index)
model.params.values

Index(['const', '%Diabetes'], dtype='object')

array([809.45713922, 767.15527376])

fig = plt.figure(figsize=(10,8))
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)
subplot =fig.add_subplot(111)
subplot.scatter(X,y, color='darkblue', label='scatter plot')
subplot.set_xlabel('%Diabetes')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X['%Diabetes'] + model.params.values[0]
subplot.plot(X['%Diabetes'], Y, color='darkorange', label='models best fit')
subplot.legend(ncol=3);

visualising model on data

Running the correlations for the < 18 and median household income

YPLL vs.< 18

# Define new X and y 
X = df['< 18']
y = df['YPLL Rate']

# fit and show summary
model = OLS(y, statsmodels.tools.add_constant(X)).fit() 
model.summary() 

OLS Regression Results
Dep. Variable:	YPLL Rate	R-squared:	0.016
Model:	OLS	Adj. R-squared:	0.016
Method:	Least Squares	F-statistic:	48.19
Date:	Wed, 21 Aug 2019	Prob (F-statistic):	4.74e-12
Time:	12:50:12	Log-Likelihood:	-27084.
No. Observations:	2941	AIC:	5.417e+04
Df Residuals:	2939	BIC:	5.418e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	6302.8803	315.173	19.998	0.000	5684.899	6920.862
< 18	91.8513	13.232	6.942	0.000	65.907	117.796

Omnibus:	407.946	Durbin-Watson:	1.290
Prob(Omnibus):	0.000	Jarque-Bera (JB):	824.396
Skew:	0.848	Prob(JB):	9.65e-180
Kurtosis:	4.962	Cond. No.	169.

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Again, let’s do some sanity checks to make sure results are not random
Prob(F-statistic) close to 0 which is less than 0.05 or 0.01 , so we have significance
R squared value not so strong. Only about 1 percent of the changes in y can be explained by the changes in x

Visualising model on this data

fig = plt.figure(figsize=(10,8)) # create figure object
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)# set super title and height position
subplot =fig.add_subplot(111) # add a subplot with 1 row and column
subplot.scatter(X,y, color='darkorange', label='scatter plot') # make scatter plot
subplot.set_xlabel('< 18')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X + model.params.values[0] 
subplot.plot(X, Y, color='black', label='models best fit') # plot model line on scatter plot
subplot.legend(ncol=4);

fitting model on data

YPLL vs.median household income

# Define new X and y 
X = df['median household income']
y = df['YPLL Rate']

model = OLS(y, statsmodels.tools.add_constant(X)).fit() # fit model to data and commute weights and biases
model.summary() # print model summary and view statistics

OLS Regression Results
Dep. Variable:	YPLL Rate	R-squared:	0.403
Model:	OLS	Adj. R-squared:	0.403
Method:	Least Squares	F-statistic:	1984.
Date:	Wed, 21 Aug 2019	Prob (F-statistic):	0.00
Time:	12:50:14	Log-Likelihood:	-26349.
No. Observations:	2941	AIC:	5.270e+04
Df Residuals:	2939	BIC:	5.271e+04
Df Model:	1
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	1.438e+04	137.216	104.809	0.000	1.41e+04	1.47e+04
median household income	-0.1331	0.003	-44.540	0.000	-0.139	-0.127

Omnibus:	697.404	Durbin-Watson:	1.377
Prob(Omnibus):	0.000	Jarque-Bera (JB):	3028.219
Skew:	1.085	Prob(JB):	0.00
Kurtosis:	7.472	Cond. No.	1.81e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.81e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Prob(F-statistic) close to 0 which is less than 0.05 or 0.01 , so we have significance
R -squared value is 0.403. This means that 40$\%$ of the changes in YPLL Rate can be explained by the changes in dependent variable x i.e median household income
median household income is also negatively correlated

what is the meaning of this?

We can constract 40$\%$ of the information about YPLL Rate by collecting data about median household income.
Percentage of those below 18 i.e [% < 18] is a weak predictor too, almost no changes are explained by it in x

Visualize this model

fig = plt.figure(figsize=(10,8)) # create figure object
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)
subplot =fig.add_subplot(111) # add subplot
subplot.scatter(X,y, color='darkgreen', label='scatter plot') # make scatter plot
subplot.set_xlabel('median household income')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X + model.params.values[0]
subplot.plot(X, Y, color='black', label='models best fit') # plot model's best fit line
subplot.legend(ncol=4);

image-title-here

Multiple Linear regression

combining infomation from multiple measures to improve model

# define X and y
cols_to_use =[
#     'FIPS',
#  'Population',
 '<_18',
#  '65_and_over',
#  'African_American',
#  'Female',
#  'Rural',
 '%Diabetes',
 'HIV_rate',
 'Physical_Inactivity',
 'mental_health_provider_rate',
 'median_household_income',
#  '%_high_housing_costs',
 '%_Free_lunch',
 '%_child_Illiteracy',
 '%_Drive_Alone',
#  'YPLL_Rate'
] # these attributes were rigorously run and the irrelevant attributes are commented out as shown 

X = df[cols_to_use] # new X or input




y = df['YPLL_Rate'] # output or target variable

model = OLS(y, statsmodels.tools.add_constant(X)).fit() # fit model with data
model.summary() # print model summary

OLS Regression Results
Dep. Variable:	YPLL_Rate	R-squared:	0.645
Model:	OLS	Adj. R-squared:	0.644
Method:	Least Squares	F-statistic:	592.0
Date:	Wed, 21 Aug 2019	Prob (F-statistic):	0.00
Time:	15:17:05	Log-Likelihood:	-25584.
No. Observations:	2941	AIC:	5.119e+04
Df Residuals:	2931	BIC:	5.125e+04
Df Model:	9
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	4156.5198	400.848	10.369	0.000	3370.548	4942.491
<_18	87.2481	9.070	9.620	0.000	69.464	105.032
%Diabetes	328.7789	22.331	14.723	0.000	284.992	372.566
HIV_rate	0.9893	0.149	6.656	0.000	0.698	1.281
Physical_Inactivity	80.4428	8.722	9.223	0.000	63.342	97.544
mental_health_provider_rate	-1.1311	0.448	-2.527	0.012	-2.009	-0.253
median_household_income	-0.0481	0.004	-13.447	0.000	-0.055	-0.041
%_Free_lunch	47.3245	2.820	16.779	0.000	41.794	52.855
%_child_Illiteracy	-43.7611	6.407	-6.830	0.000	-56.324	-31.198
%_Drive_Alone	-31.4760	4.133	-7.616	0.000	-39.580	-23.372

Omnibus:	779.012	Durbin-Watson:	1.748
Prob(Omnibus):	0.000	Jarque-Bera (JB):	4556.589
Skew:	1.125	Prob(JB):	0.00
Kurtosis:	8.668	Cond. No.	6.86e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.86e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Analysis
free lunch is strongly correlated to YPLL rate
should we eliminate free lunch to improve life and reduce YPLL Rate?
or peharps there is a third variable connecting everything that policy makers should look into?

Looking at Non-linearity..

linear regression only helps us come up with the best fit line for our data
But there are cases of weak interaction between variables (non-linearity)
let’s try fitting with a polynomial fit using sklearn

from sklearn.preprocessing import PolynomialFeatures # helps us fit polynomials to our data for best fit

poly_transform = PolynomialFeatures(degree=3) # import polynomial feat and instantiate

X_ply = poly_transform.fit_transform(X) # fit and transform input
model = OLS(y, statsmodels.tools.add_constant(X_ply)).fit() # fit model to data

print('R-squared value: {:.2f}'.format(model.rsquared)) # print R squared values
print('Adjusted R-squared value: {:.2f}'.format(model.rsquared_adj))

R-squared value: 0.76
Adjusted R-squared value: 0.74

conclusion

The initial Adj R squared value of .64 has been improved upon greatly using the polynomial features method.
The new R squared values are 0.76 and 0.74 for the adjusted!
Note: Train test split procedure was not used
To be done in a future notebook using the same data with metrics!
This was an illustration of how to back a model from a statistical point of view..

Share on

Twitter Facebook LinkedIn

John Bosco