Machine Learning: Ordinary least squares using statsmodel api

15 minute read

Regression: Building a predictive model

Linear regression or the ordinary least sqaures method, is a supervised machine learning technique that is used to predict a continuos valued output. Everytime we fit a multiple linear regression model to our data, we compute a set of weights or coefficients, $\beta_0,\beta_1,\beta_2, \beta_3…\beta_n$ where $\beta_0$ is the intercept or the constant plus a bias term also called the error term $\epsilon$.

In this post, I will show you how to build a predictive model using the statsmodel api. Before we get into that, let’s talk about the data we will be using. There are two datasets that will be used for this predictive model i.e. County health rankings and in particular the Years of potential life lost (YPPL) and Additional measures data of which you can find the cleaned up versions on my github.

Years of potential life lost or YPPL is an estimate of the average years a person would have lived if he or she had not died prematurely. It is, therefore, a measure of premature mortality. As an alternative to death rates, it is a method that gives more weight to deaths that occur among younger people.

To calculate the years of potential life lost or YPPL, an upper reference age is determined. The reference age should correspond roughly to the life expectancy of the population under study. In this data, the reference age is 75. So if a person dies at 65, their YPPL is calculated as: 75 - 65 = 10 and so on.

Now, let’s get into the data. First step of any data science project is to clean the data. This may include handling missing values if any, normalizing our data, perfoming One Hot encoding for categorical variables etc. Machine learning models do not work if our data have missing values so it’s very important that we check for missing values.

python code block to import data analysis packages:

# import data analysis libraries
import pandas as pd
import numpy as np

# import plotting library matplotlib
import matplotlib.pyplot as plt


# import ols from sklearn / predictive modeling
from statsmodels.api import OLS

# turn off future warnings
import warnings
warnings.filterwarnings(action='ignore')


# normalising packages
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, MaxAbsScaler
ypll = pd.read_csv('ypll.csv') # load data
ypll.head() # view first 5 rows
FIPS State County Unreliable YPLL Rate
0 1000 Alabama NaN NaN 10189.0
1 1001 Alabama Autauga NaN 9967.0
2 1003 Alabama Baldwin NaN 8322.0
3 1005 Alabama Barbour NaN 9559.0
4 1007 Alabama Bibb NaN 13283.0
add_measures = pd.read_csv('additional_measures_cleaned.csv') # import and load data
add_measures.head() # view first 5 rows
FIPS State County Population < 18 65 and over African American Female Rural %Diabetes HIV rate Physical Inactivity mental health provider rate median household income % high housing costs % Free lunch % child Illiteracy % Drive Alone
0 1000 Alabama NaN 4708708 23.9 13.8 26.1 51.6 44.6 12 NaN 31 20 42586.0 30 51.0 14.8 84
1 1001 Alabama Autauga 50756 27.8 11.6 18.4 51.4 44.8 11 170.0 33 2 51622.0 25 29.0 12.7 86
2 1003 Alabama Baldwin 179878 23.1 17.0 10.0 51.0 54.2 10 176.0 25 17 51957.0 29 29.0 10.6 83
3 1005 Alabama Barbour 29737 22.3 13.8 46.6 46.8 71.5 14 331.0 35 7 30896.0 36 65.0 23.2 82
4 1007 Alabama Bibb 21587 23.3 13.5 22.3 48.0 81.5 11 90.0 37 0 41076.0 18 48.0 17.5 83
ypll.info() # check dataset information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 5 columns):
FIPS          3192 non-null int64
State         3192 non-null object
County        3141 non-null object
Unreliable    196 non-null object
YPLL Rate     3097 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 124.8+ KB
# drop countries where data was deemed unreliable, marked by x
ypll.drop(index=ypll[ypll.Unreliable.isin(['x'])].index, inplace=True) 
add_measures.info() # check dataset information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3192 entries, 0 to 3191
Data columns (total 18 columns):
FIPS                           3192 non-null int64
State                          3192 non-null object
County                         3141 non-null object
Population                     3192 non-null int64
< 18                           3192 non-null float64
65 and over                    3192 non-null float64
African American               3192 non-null float64
Female                         3192 non-null float64
Rural                          3191 non-null float64
%Diabetes                      3192 non-null int64
HIV rate                       2230 non-null float64
Physical Inactivity            3192 non-null int64
mental health provider rate    3192 non-null int64
median household income        3191 non-null float64
% high housing costs           3192 non-null int64
% Free lunch                   3173 non-null float64
% child Illiteracy             3186 non-null float64
% Drive Alone                  3192 non-null int64
dtypes: float64(9), int64(7), object(2)
memory usage: 449.0+ KB
# drop rows where column % child illiteracy has missing values
add_measures.drop(index=add_measures[add_measures['% child Illiteracy'].isnull()].index, inplace=True) 
# drop rows where column County has missing values
add_measures.drop(index=add_measures[add_measures['County'].isnull()].index, inplace=True)
ypll.columns # view list of columns
Index(['FIPS', 'State', 'County', 'Unreliable', 'YPLL Rate'], dtype='object')
cols = [
    'FIPS',
#  'State',
 'County',
 'Population',
 '< 18',
 '65 and over',
 'African American',
 'Female',
 'Rural',
 '%Diabetes',
 'HIV rate',
 'Physical Inactivity',
 'mental health provider rate',
 'median household income',
 '% high housing costs',
 '% Free lunch',
 '% child Illiteracy',
 '% Drive Alone'
] # list of columns or artributes

left = add_measures[cols] # create left dataframe
left.head() # view the first five rows
FIPS County Population < 18 65 and over African American Female Rural %Diabetes HIV rate Physical Inactivity mental health provider rate median household income % high housing costs % Free lunch % child Illiteracy % Drive Alone
1 1001 Autauga 50756 27.8 11.6 18.4 51.4 44.8 11 170.0 33 2 51622.0 25 29.0 12.7 86
2 1003 Baldwin 179878 23.1 17.0 10.0 51.0 54.2 10 176.0 25 17 51957.0 29 29.0 10.6 83
3 1005 Barbour 29737 22.3 13.8 46.6 46.8 71.5 14 331.0 35 7 30896.0 36 65.0 23.2 82
4 1007 Bibb 21587 23.3 13.5 22.3 48.0 81.5 11 90.0 37 0 41076.0 18 48.0 17.5 83
5 1009 Blount 58345 24.2 14.7 2.1 50.2 91.0 11 66.0 35 2 46086.0 21 37.0 13.9 80
# drop rows where the County is missing values
ypll.drop(index=ypll[ypll.County.isnull()].index, inplace=True)
len(ypll) # check length of dataframe
2945
df = pd.merge(left, ypll, on=['County', 'FIPS'] ) # merge the two data sets to make one dataframe
df.head() # view the first five rows of the data
FIPS County Population < 18 65 and over African American Female Rural %Diabetes HIV rate Physical Inactivity mental health provider rate median household income % high housing costs % Free lunch % child Illiteracy % Drive Alone State Unreliable YPLL Rate
0 1001 Autauga 50756 27.8 11.6 18.4 51.4 44.8 11 170.0 33 2 51622.0 25 29.0 12.7 86 Alabama NaN 9967.0
1 1003 Baldwin 179878 23.1 17.0 10.0 51.0 54.2 10 176.0 25 17 51957.0 29 29.0 10.6 83 Alabama NaN 8322.0
2 1005 Barbour 29737 22.3 13.8 46.6 46.8 71.5 14 331.0 35 7 30896.0 36 65.0 23.2 82 Alabama NaN 9559.0
3 1007 Bibb 21587 23.3 13.5 22.3 48.0 81.5 11 90.0 37 0 41076.0 18 48.0 17.5 83 Alabama NaN 13283.0
4 1009 Blount 58345 24.2 14.7 2.1 50.2 91.0 11 66.0 35 2 46086.0 21 37.0 13.9 80 Alabama NaN 8475.0
df['YPLL Rate'].fillna(value=df['YPLL Rate'].mean(), inplace=True) # fill missing values with mean of column
df.drop(labels=['Unreliable'], inplace=True, axis=1) # drop column with NaN's (not neccesary for our analysis)
df.info() # check information of data set
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2941 entries, 0 to 2940
Data columns (total 19 columns):
FIPS                           2941 non-null int64
County                         2941 non-null object
Population                     2941 non-null int64
< 18                           2941 non-null float64
65 and over                    2941 non-null float64
African American               2941 non-null float64
Female                         2941 non-null float64
Rural                          2941 non-null float64
%Diabetes                      2941 non-null int64
HIV rate                       2220 non-null float64
Physical Inactivity            2941 non-null int64
mental health provider rate    2941 non-null int64
median household income        2941 non-null float64
% high housing costs           2941 non-null int64
% Free lunch                   2926 non-null float64
% child Illiteracy             2941 non-null float64
% Drive Alone                  2941 non-null int64
State                          2941 non-null object
YPLL Rate                      2941 non-null float64
dtypes: float64(10), int64(7), object(2)
memory usage: 459.5+ KB
  • HIV Rate and free lunch still have missing values and need to be imputed! Let’s impute for them
df['% Free lunch'].fillna(value=df['% Free lunch'].mean(), inplace=True) # fill in missing values for these columns
df['HIV rate'].fillna(value=df['HIV rate'].mean(), inplace=True)

Visualising correlations

import seaborn as sns # another visualisation package for pretty plots
sns.set()
fig = plt.figure(figsize=(10,15)) # create figure object
plt.tight_layout()
fig.suptitle('Visualising correlations', y=0.9, fontweight=1000) # set super title
subplot = fig.add_subplot(3,1,1) # add subplot with 3 rows and 1 column, choose axis 1

subplot.scatter(df['YPLL Rate'], df['%Diabetes'], color='orange') # make scatter plot
subplot.set_xlabel('ypll rate') # set x label
subplot.set_ylabel('% diabetes') # y label

subplot = fig.add_subplot(3,1,2) # choose axis 2
subplot.scatter(df['YPLL Rate'], df['< 18'], color='red') # scatter plot
subplot.set_xlabel('ypll rate')
subplot.set_ylabel('% < 18')


subplot = fig.add_subplot(3,1,3) # choose axis 3
subplot.scatter(df['YPLL Rate'], df['median household income'], color='green') # scatter plot
subplot.set_xlabel('ypll rate')
subplot.set_ylabel('median household income'); # set titlte

scatter plot

visualising other variables vs ypll

plot=[
 'FIPS',
 'Population',
#  '< 18',
 '65 and over',
 'African American',
 'Female',
 'Rural',
#  '%Diabetes',
 'HIV rate',
 'Physical Inactivity',
 'mental health provider rate',
#  'median household income',
 '% high housing costs',
 '% Free lunch',
 '% child Illiteracy',
 '% Drive Alone',
 'YPLL Rate'
] # columns to plot
fig = plt.figure(figsize=(15,35)) # create figure object
# plt.tight_layout()
fig.suptitle('Visualising correlations', y=0.9, fontweight=1000) # set super title

for i in range(len(plot)): # plot target against different attributes 
    
    subplot = fig.add_subplot(len(plot),3,i+1)

    subplot.scatter(df['YPLL Rate'], df[plot[i]], color='green')
    subplot.set_xlabel('ypll rate')
    subplot.set_ylabel(plot[i])

attributes vs target variable

  • some variables are not strongly correlated to target or output e.g. FIPS, HIV Rate and population etc
  • other variables like diabetes $\&$ % free lunch have a good correlation with the target variable. At this point we may start thinking that free lunch should no longer be on the offer. But perharps there is a factor playing a hideous role?
  • you can check the pearson correlation with the target variable in the code output of the next cell.
# correlation matrix
# sort by the top 8 values
cor_matrx=df.corr() # make a correlation matrix using pearson corr.
cor_matrx['YPLL Rate'].sort_values(ascending=False) # view correlations on scale -1 to 1
YPLL Rate                      1.000000
% Free lunch                   0.694590
%Diabetes                      0.662009
Physical Inactivity            0.645538
% child Illiteracy             0.461458
African American               0.436223
Rural                          0.286216
HIV rate                       0.175228
< 18                           0.127011
Female                         0.116916
65 and over                    0.071458
% Drive Alone                  0.031896
FIPS                          -0.062884
% high housing costs          -0.106475
Population                    -0.162204
mental health provider rate   -0.177113
median household income       -0.634813
Name: YPLL Rate, dtype: float64

YPLL vs. % Diabetes regression

# model = OLS # rename model

# define independent and dependent variable
X = df[['%Diabetes']]
y = df['YPLL Rate']
# fit model and predict using OLS (same as LinearRegression)
# show summary
# add bias term 

import statsmodels
model = OLS(y, statsmodels.tools.add_constant(X)).fit()
model.summary()
OLS Regression Results
Dep. Variable: YPLL Rate R-squared: 0.438
Model: OLS Adj. R-squared: 0.438
Method: Least Squares F-statistic: 2293.
Date: Wed, 21 Aug 2019 Prob (F-statistic): 0.00
Time: 12:50:10 Log-Likelihood: -26260.
No. Observations: 2941 AIC: 5.252e+04
Df Residuals: 2939 BIC: 5.254e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 809.4571 163.462 4.952 0.000 488.946 1129.969
%Diabetes 767.1553 16.021 47.884 0.000 735.742 798.569
Omnibus: 1081.131 Durbin-Watson: 1.545
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7964.973
Skew: 1.552 Prob(JB): 0.00
Kurtosis: 10.441 Cond. No. 50.0



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Intepreting the math

  • let’s intepret the data and check the statistical significance to make sure nothing happend by chance and that our regression is meaningful !

  • Prob(F-statistic) is close to 0 which is less than 0.05 or 0.01, so we have significance & thus we can safely intepret the rest of the data !

  • for this model the line of best fit is represented by y = mx + constant or ypll = 767.15 * $\%$ diabetes + 809.45

  • To check how well this line/model performs, the R- squared value of 0.438 is recorded. This means that 43.8$\%$ of the changes in y (YPLL rate) can be predicted by changes in x ($\%$Diabetes)

what does this all mean?

  • Simply put, it means that without knowing the YPLL of a community, we can simply collect data about percentage of people affected by diabetes and construct 43.8$\%$ of the YPLL rate behaviour

visualize the model built above

# print cofficients
print(model.params.index)
model.params.values
Index(['const', '%Diabetes'], dtype='object')





array([809.45713922, 767.15527376])
fig = plt.figure(figsize=(10,8))
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)
subplot =fig.add_subplot(111)
subplot.scatter(X,y, color='darkblue', label='scatter plot')
subplot.set_xlabel('%Diabetes')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X['%Diabetes'] + model.params.values[0]
subplot.plot(X['%Diabetes'], Y, color='darkorange', label='models best fit')
subplot.legend(ncol=3);

visualising model on data

Running the correlations for the < 18 and median household income

YPLL vs.< 18

# Define new X and y 
X = df['< 18']
y = df['YPLL Rate']
# fit and show summary
model = OLS(y, statsmodels.tools.add_constant(X)).fit() 
model.summary() 
OLS Regression Results
Dep. Variable: YPLL Rate R-squared: 0.016
Model: OLS Adj. R-squared: 0.016
Method: Least Squares F-statistic: 48.19
Date: Wed, 21 Aug 2019 Prob (F-statistic): 4.74e-12
Time: 12:50:12 Log-Likelihood: -27084.
No. Observations: 2941 AIC: 5.417e+04
Df Residuals: 2939 BIC: 5.418e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 6302.8803 315.173 19.998 0.000 5684.899 6920.862
< 18 91.8513 13.232 6.942 0.000 65.907 117.796
Omnibus: 407.946 Durbin-Watson: 1.290
Prob(Omnibus): 0.000 Jarque-Bera (JB): 824.396
Skew: 0.848 Prob(JB): 9.65e-180
Kurtosis: 4.962 Cond. No. 169.



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

  • Again, let’s do some sanity checks to make sure results are not random
  • Prob(F-statistic) close to 0 which is less than 0.05 or 0.01 , so we have significance
  • R squared value not so strong. Only about 1 percent of the changes in y can be explained by the changes in x

Visualising model on this data

fig = plt.figure(figsize=(10,8)) # create figure object
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)# set super title and height position
subplot =fig.add_subplot(111) # add a subplot with 1 row and column
subplot.scatter(X,y, color='darkorange', label='scatter plot') # make scatter plot
subplot.set_xlabel('< 18')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X + model.params.values[0] 
subplot.plot(X, Y, color='black', label='models best fit') # plot model line on scatter plot
subplot.legend(ncol=4);

fitting model on data

YPLL vs.median household income

# Define new X and y 
X = df['median household income']
y = df['YPLL Rate']
model = OLS(y, statsmodels.tools.add_constant(X)).fit() # fit model to data and commute weights and biases
model.summary() # print model summary and view statistics
OLS Regression Results
Dep. Variable: YPLL Rate R-squared: 0.403
Model: OLS Adj. R-squared: 0.403
Method: Least Squares F-statistic: 1984.
Date: Wed, 21 Aug 2019 Prob (F-statistic): 0.00
Time: 12:50:14 Log-Likelihood: -26349.
No. Observations: 2941 AIC: 5.270e+04
Df Residuals: 2939 BIC: 5.271e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 1.438e+04 137.216 104.809 0.000 1.41e+04 1.47e+04
median household income -0.1331 0.003 -44.540 0.000 -0.139 -0.127
Omnibus: 697.404 Durbin-Watson: 1.377
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3028.219
Skew: 1.085 Prob(JB): 0.00
Kurtosis: 7.472 Cond. No. 1.81e+05



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.81e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

  • Prob(F-statistic) close to 0 which is less than 0.05 or 0.01 , so we have significance
  • R -squared value is 0.403. This means that 40$\%$ of the changes in YPLL Rate can be explained by the changes in dependent variable x i.e median household income
  • median household income is also negatively correlated

what is the meaning of this?

  • We can constract 40$\%$ of the information about YPLL Rate by collecting data about median household income.
  • Percentage of those below 18 i.e [% < 18] is a weak predictor too, almost no changes are explained by it in x

Visualize this model

fig = plt.figure(figsize=(10,8)) # create figure object
fig.suptitle('Plot Showing how model fits data', fontweight=1000, y=0.92)
subplot =fig.add_subplot(111) # add subplot
subplot.scatter(X,y, color='darkgreen', label='scatter plot') # make scatter plot
subplot.set_xlabel('median household income')
subplot.set_ylabel('YPLL Rate')

# y = mx + c
Y = model.params.values[1] * X + model.params.values[0]
subplot.plot(X, Y, color='black', label='models best fit') # plot model's best fit line
subplot.legend(ncol=4);

image-title-here

Multiple Linear regression

  • combining infomation from multiple measures to improve model
# define X and y
cols_to_use =[
#     'FIPS',
#  'Population',
 '<_18',
#  '65_and_over',
#  'African_American',
#  'Female',
#  'Rural',
 '%Diabetes',
 'HIV_rate',
 'Physical_Inactivity',
 'mental_health_provider_rate',
 'median_household_income',
#  '%_high_housing_costs',
 '%_Free_lunch',
 '%_child_Illiteracy',
 '%_Drive_Alone',
#  'YPLL_Rate'
] # these attributes were rigorously run and the irrelevant attributes are commented out as shown 

X = df[cols_to_use] # new X or input




y = df['YPLL_Rate'] # output or target variable
model = OLS(y, statsmodels.tools.add_constant(X)).fit() # fit model with data
model.summary() # print model summary
OLS Regression Results
Dep. Variable: YPLL_Rate R-squared: 0.645
Model: OLS Adj. R-squared: 0.644
Method: Least Squares F-statistic: 592.0
Date: Wed, 21 Aug 2019 Prob (F-statistic): 0.00
Time: 15:17:05 Log-Likelihood: -25584.
No. Observations: 2941 AIC: 5.119e+04
Df Residuals: 2931 BIC: 5.125e+04
Df Model: 9
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 4156.5198 400.848 10.369 0.000 3370.548 4942.491
<_18 87.2481 9.070 9.620 0.000 69.464 105.032
%Diabetes 328.7789 22.331 14.723 0.000 284.992 372.566
HIV_rate 0.9893 0.149 6.656 0.000 0.698 1.281
Physical_Inactivity 80.4428 8.722 9.223 0.000 63.342 97.544
mental_health_provider_rate -1.1311 0.448 -2.527 0.012 -2.009 -0.253
median_household_income -0.0481 0.004 -13.447 0.000 -0.055 -0.041
%_Free_lunch 47.3245 2.820 16.779 0.000 41.794 52.855
%_child_Illiteracy -43.7611 6.407 -6.830 0.000 -56.324 -31.198
%_Drive_Alone -31.4760 4.133 -7.616 0.000 -39.580 -23.372
Omnibus: 779.012 Durbin-Watson: 1.748
Prob(Omnibus): 0.000 Jarque-Bera (JB): 4556.589
Skew: 1.125 Prob(JB): 0.00
Kurtosis: 8.668 Cond. No. 6.86e+05



Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.86e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

  • Analysis

  • free lunch is strongly correlated to YPLL rate
  • should we eliminate free lunch to improve life and reduce YPLL Rate?
  • or peharps there is a third variable connecting everything that policy makers should look into?

Looking at Non-linearity..

  • linear regression only helps us come up with the best fit line for our data
  • But there are cases of weak interaction between variables (non-linearity)
  • let’s try fitting with a polynomial fit using sklearn
from sklearn.preprocessing import PolynomialFeatures # helps us fit polynomials to our data for best fit
poly_transform = PolynomialFeatures(degree=3) # import polynomial feat and instantiate

X_ply = poly_transform.fit_transform(X) # fit and transform input
model = OLS(y, statsmodels.tools.add_constant(X_ply)).fit() # fit model to data
print('R-squared value: {:.2f}'.format(model.rsquared)) # print R squared values
print('Adjusted R-squared value: {:.2f}'.format(model.rsquared_adj))
R-squared value: 0.76
Adjusted R-squared value: 0.74

conclusion

  • The initial Adj R squared value of .64 has been improved upon greatly using the polynomial features method.
  • The new R squared values are 0.76 and 0.74 for the adjusted!
  • Note: Train test split procedure was not used
  • To be done in a future notebook using the same data with metrics!
  • This was an illustration of how to back a model from a statistical point of view..