Linear regression

We discussed linear regression way back in Lecture 9.

Why are we talking about it again?

First, let's recap:

The goal with linear regression was to find the line that best fit some given XY-data (and test if that fit was significant). For the \(i\)th data point:

\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]

We want to find the pair of values (\(\beta_0\),\(\beta_1\)) that minimize the sum of square errors \(S\):

\[S(\beta) = \sum_i \epsilon_i^2 = \sum_i \left[y_i - \left(\beta_0 + \beta_1 x_i\right)\right]^2\]

Of course, this holds for all the data points together:

\[y_1 = \beta_0 + \beta_1 x_1 + \epsilon_1\\ y_2 = \beta_0 + \beta_1 x_2 + \epsilon_2\\ y_3 = \beta_0 + \beta_1 x_3 + \epsilon_3\\ \vdots \\ y_n = \beta_0 + \beta_1 x_n + \epsilon_n \]

We can write this in matrix notation:

Now a single matrix equation captures all individual linear regression equations. Furthermore, if we want to estimate the \(\mathbf{\beta}\)'s with Ordinary Least Squares (OLS), we have a (relatively) simple way to write down the best \(\mathbf{beta}\)'s.

In matrix form the residuals are:

\[\mathbf{\epsilon} = \mathbf{y} - \mathbf{X}\mathbf{\beta}\]

The sum of the squared residuals is then (using the inner product):

\[\mathbf{\epsilon}^\mathrm{T}\mathbf{\epsilon} = \left(\mathbf{y} - \mathbf{X}\mathbf{\beta}\right)^\mathrm{T}\left(\mathbf{y} - \mathbf{X}\mathbf{\beta}\right)\]

where \(\mathbf{\epsilon}^\mathrm{T}\) is the transpose of \(\mathbf{\epsilon}\).

With a little calculus and elbow grease we can find the minimum of this equation, which can be solved to give us the best estimates for \(\mathbf{\beta}\), called \(\mathbf{\hat{\beta}}\):

\[\mathbf{\hat{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{y}\]

assuming \(\mathbf{X}^\mathrm{T}\mathbf{X}\) is nonsingular, meaning we can compute the matrix inverse.

We already computed the values of \(\beta_0\) and \(\beta_1\), so why are we now getting into all this matrix nonsense?

Because the above derivation is not limited to two coefficients and a linear regression of the form \(y = \beta_0 + \beta_1 x\).
Until now, we've been doing simple linear regression.

Multiple linear regression.

There is no reason why we need to stop our equation at two coefficients. What about this:

\[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ 1 & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]

In matrix form that is exactly the same equation:

\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]

except that \(\mathbf{X}\) has \(p+1\) columns and \(\mathbf{\beta}\) has \(p+1\) rows.

This means we can solve exactly the same linear regression but we are not limited to a single column of \(x\) data!

*** For example, we may have reason to study a linear model relating a person's height and weight:

\[\mathrm{weight} = \beta_0 + \beta_1 \mathrm{height}\]

Now we can also incorporate (probably unrealistically) age data as well:

\[\mathrm{weight} = \beta_0 + \beta_1 \mathrm{height} + \beta_2 \mathrm{age}\]

We have a new coefficient, and we have a new column (the age column) in our data matrix. But estimating \(\mathbf{\hat{\beta}}\) remains the same. ***

A note on jargon

There's a lot of names for the quantities in these equations, it's hard to keep straight.

The \(y_i\)'s are called:

Regressand
endogenous variable
response variable
dependent variable

The \(x_i\)'s (including the constant \(1\)) are called:

Regressors
exogenous variables
explanatory variables
independent variables

The matrix of data \(\mathbf{X}\) is called the design matrix.

The \(\beta\)'s are called:

effects
regression coefficients
regression parameters

The vector \(\mathbf{\beta}\) is called the paramter or coefficient vector.

The \(\epsilon\)'s are called errors, error term, or noise.

Handy mnemonic for the \(x\)'s and \(y\)'s:

The \(x\)'s are the exogenous variable because "exogenous" contains "x".

Let's see this in action:

We're going to use a new library called statsmodels.

First let's generate some data:

In [2]:

nobs = 100
b = [1.0,0.1,0.5] # betas
Xlist = []
Ylist = []
for _ in range(nobs):
    # build the individual observation:
    x1 = np.random.random() # a single number
    x2 = np.random.random()
    e = np.random.randn()*0.05 # random noise
    y = 1.0*b[0] + x1*b[1] + x2*b[2] + e
    
    # add it to the matrices:
    Xlist.append([1.0,x1,x2]) # design matrix/exog variables
    Ylist.append(y) # endogenous variables

Let's even print the first few lines:

In [3]:

for i in range(5):
    print Ylist[i], "-->", Xlist[i]
print
print "Size Y =", len(Ylist)
print "Size X =", len(Xlist)

1.52224865116 --> [1.0, 0.4582368681074952, 0.9163807664289918]
1.37116288266 --> [1.0, 0.872695155213393, 0.45654930515939185]
1.20205518642 --> [1.0, 0.01043486139008698, 0.29765582812120606]
1.55258200524 --> [1.0, 0.06708456146822295, 0.9177564173506187]
1.07514475864 --> [1.0, 0.5140395492772628, 0.05031748109680545]

Size Y = 100
Size X = 100

Looks good.

If you're comfortable with matrix operations, you can also build the data using matrix multiplication:

In [4]:

X = np.hstack(( np.ones((nobs,1)), np.random.random((nobs,2)) ))

beta = [1, 0.1, 0.5]
e = np.random.random(nobs)*0.05
y = np.dot(X, beta) + e

print "(rows, columns)"
print X.shape
print y.shape

(rows, columns)
(100, 3)
(100,)

But matrix operations are not very comfortable in python (at least compared to MATLAB), so you might just want to avoid them. It's up to you.

Anyway, now that we've got the data generated, let's use statsmodels OLS function to find the \(\mathbf{\hat{\beta}}\):

In [5]:

import statsmodels.api as sm

# Fit regression model
model = sm.OLS(y,X) # our matrix data
#model = sm.OLS(Ylist, Xlist) # our lists of data

result = model.fit()

Wow, that was pretty easy, but we now we need to understand what results is...

model.fit returns an object that contains a bunch of data about the fit, residuals, coefficient values, and other useful statistics. It also has a few methods attached to it. One very handy method it has is summary:

In [6]:

print result.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.991
Model:                            OLS   Adj. R-squared:                  0.991
Method:                 Least Squares   F-statistic:                     5211.
Date:                Tue, 01 Apr 2014   Prob (F-statistic):           1.96e-99
Time:                        07:51:31   Log-Likelihood:                 282.20
No. Observations:                 100   AIC:                            -558.4
Df Residuals:                      97   BIC:                            -550.6
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          1.0163      0.004    241.310      0.000         1.008     1.025
x1             0.1078      0.005     20.106      0.000         0.097     0.118
x2             0.5123      0.005    101.481      0.000         0.502     0.522
==============================================================================
Omnibus:                       28.017   Durbin-Watson:                   1.986
Prob(Omnibus):                  0.000   Jarque-Bera (JB):                5.844
Skew:                          -0.124   Prob(JB):                       0.0538
Kurtosis:                       1.842   Cond. No.                         5.79
==============================================================================

This type of display is very common across statistical packages. There's three main panels.

The top panel contains information about the size of the data and the overall accuracy of the fit, as well as some nice records like time of the calculation.
The middle panel contains information about each coefficient of the fit.
The bottom panel contains information about the residuals \(\epsilon_i\) of the fit, are they normally distributed (omnibus test), do they possess serial correlations (Durbin-Watson), etc.

We can also access the information shown in the summary directly:

In [7]:

print result.params # the coefficients

print result.rsquared

[ 1.0163227   0.10775861  0.51227136]
0.990778431247

Here are all the things result has:

In [8]:

for name in dir(result):
    if "__" in name: # skip "private" stuff
        continue
    print name

HC0_se
HC1_se
HC2_se
HC3_se
_HC0_se
_HC1_se
_HC2_se
_HC3_se
_HCCM
_cache
_data_attr
aic
bic
bse
centered_tss
compare_f_test
compare_lr_test
conf_int
conf_int_el
cov_params
df_model
df_resid
diagn
el_test
ess
f_pvalue
f_test
fittedvalues
fvalue
get_influence
initialize
k_constant
llf
load
model
mse_model
mse_resid
mse_total
nobs
norm_resid
normalized_cov_params
outlier_test
params
predict
pvalues
remove_data
resid
rsquared
rsquared_adj
save
scale
ssr
summary
summary2
t_test
tvalues
uncentered_tss
wresid

There's lots of data we can interact with regarding the fit...

Here's another example. In this case we are going to use sm.add_constant to more quickly add the column of ones to the design matrix.

This uses real data as well so we don't need to add fake \(\epsilon\)'s ourselves.
But let's add a random variable and see if the model rejects it...

In [9]:

Height   = [1.47,1.50,1.52,1.55,1.57,1.60,1.63,1.65,
            1.68,1.70,1.73,1.75,1.78,1.80,1.83]

Nonsense = list(np.random.random(len(Height))-1)


Weight   = [52.21,53.12,54.48,55.84,57.20,58.57,59.93,
            61.29,63.11,64.47,66.28,68.10,69.92,72.19,74.46]

# stack exog variables into design matrix
X = np.column_stack( (Height,Nonsense) )
X = sm.add_constant(X, prepend=True) # add a column of 1's out front


res = sm.OLS(Weight,X).fit() # create a model and fit it

print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.990
Model:                            OLS   Adj. R-squared:                  0.989
Method:                 Least Squares   F-statistic:                     609.1
Date:                Tue, 01 Apr 2014   Prob (F-statistic):           8.62e-13
Time:                        07:51:31   Log-Likelihood:                -15.311
No. Observations:                  15   AIC:                             36.62
Df Residuals:                      12   BIC:                             38.75
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const        -37.9322      3.071    -12.350      0.000       -44.624   -31.240
x1            60.8164      1.802     33.753      0.000        56.891    64.742
x2             0.7611      0.670      1.135      0.278        -0.699     2.222
==============================================================================
Omnibus:                        2.004   Durbin-Watson:                   0.806
Prob(Omnibus):                  0.367   Jarque-Bera (JB):                1.519
Skew:                           0.721   Prob(JB):                        0.468
Kurtosis:                       2.405   Cond. No.                         36.7
==============================================================================

Let's make a quickie plot of the solution and the data:

In [10]:

from mpl_toolkits.mplot3d import Axes3D
from JSAnimation import IPython_display
from matplotlib import animation

fig = plt.figure(figsize=(8,5))
ax  = Axes3D(fig)

x = np.linspace(min(Height),max(Height),50)
y = np.linspace(min(Nonsense),max(Nonsense),50)
x, y = np.meshgrid(x,y)
z = -38.9522 + 61.1270*x -0.2821*y

ax.scatter(Height,Nonsense,Weight, c='r', s=50)
ax.plot_surface(x,y,z, alpha=0.5, lw=0)

ax.set_xlabel("Height")
ax.set_ylabel("Nonsense")
ax.set_zlabel("Weight")


# point-of-view: (altitude degrees, azimuth degrees)
ax.view_init(30, 0)

def animate(t):
    # update POV:
    ax.view_init(30 + np.sin(2*np.pi*t/120.0)*30, 
                 3*t )
    return []

animation.FuncAnimation(fig, animate,
                        frames=120,  # number of frames to draw
                        interval=40, # time (ms) on each frame
                        blit=True)

Out[10]:

OK, so we see that, since the "Nonsense" axis isn't meaningful, our regression is essentially a flat plane. Let's do an example with real data, where both variables matter:

In [11]:

data = np.genfromtxt("anaesthesia_1959.csv", dtype=float, 
                     delimiter=',', names=True) 

# load each column into separate array:
LogDose   = data["LogDose"]
BP        = data["BP"]
RecovTime = data["RecovTime"]

This data measures the time to recovery from anaesthetic as a function of blood pressure and (the logarithm) of the dosage.

Fit it up!!!

In [12]:

# stack exog variables into design matrix
X = np.column_stack( (LogDose,BP/10.0) )
X = sm.add_constant(X, prepend=True) # add a column of 1's out front


res = sm.OLS(RecovTime,X).fit() # create a model and fit it

print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.228
Model:                            OLS   Adj. R-squared:                  0.197
Method:                 Least Squares   F-statistic:                     7.364
Date:                Tue, 01 Apr 2014   Prob (F-statistic):            0.00157
Time:                        07:51:42   Log-Likelihood:                -214.45
No. Observations:                  53   AIC:                             434.9
Df Residuals:                      50   BIC:                             440.8
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const         22.2712     17.555      1.269      0.210       -12.989    57.531
x1            10.6399      2.856      3.726      0.000         4.904    16.376
x2            -7.4009      2.893     -2.558      0.014       -13.212    -1.590
==============================================================================
Omnibus:                        7.999   Durbin-Watson:                   1.729
Prob(Omnibus):                  0.018   Jarque-Bera (JB):                7.263
Skew:                           0.870   Prob(JB):                       0.0265
Kurtosis:                       3.510   Cond. No.                         74.0
==============================================================================

In [13]:

fig = plt.figure(figsize=(8,5))
ax  = Axes3D(fig)

x = np.linspace(min(LogDose),max(LogDose),50)
y = np.linspace(min(BP/10.0),max(BP/10.0),50)
x, y = np.meshgrid(x,y)
z = 22.2712 + 10.6399*x -7.4009*y

ax.scatter(LogDose,BP/10.0,RecovTime, c='r', s=50)
ax.plot_surface(x,y,z, alpha=0.5, lw=0)

ax.set_xlabel("LogDose")
ax.set_ylabel("Blood Pressure/10")
ax.set_zlabel("Recovery Time")


# point-of-view: (altitude degrees, azimuth degrees)
ax.view_init(30, 0)

def animate(t):
    # update POV:
    ax.view_init(30 + np.sin(2*np.pi*t/120.0)*40, 
                 3*t )
    return []

animation.FuncAnimation(fig, animate,
                        frames=120,  # number of frames to draw
                        interval=40, # time (ms) on each frame
                        blit=True)

Out[13]:

In this case, since both BP and LogDose are significant, the plane of the best fit function is tilted in both the X and Y directions.

There's one issue with the previous summaries, which is that we need to remember that x1 is Height and x2 is Nonsense, or whatever.

Statsmodels provides another very compact way of specifying a linear fit, but to use it we need to play around with another data structure (introduced by R) called a DataFrame. Python dataframes are provided by a nice third-party package called pandas

Let's convert the Height/Weight/Nonsense data to a Pandas dataframe and then we'll look at it to see what it does:

In [14]:

print Weight

[52.21, 53.12, 54.48, 55.84, 57.2, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.1, 69.92, 72.19, 74.46]

In [15]:

import pandas as pd

data = zip(Height,Nonsense,Weight)
df = pd.DataFrame(data, 
                  columns=["Height","Nonsense","Weight"])

print df

    Height  Nonsense  Weight
0     1.47 -0.798803   52.21
1     1.50 -0.015618   53.12
2     1.52 -0.764160   54.48
3     1.55 -0.400369   55.84
4     1.57 -0.725750   57.20
5     1.60 -0.152411   58.57
6     1.63 -0.689942   59.93
7     1.65 -0.949025   61.29
8     1.68 -0.702378   63.11
9     1.70 -0.204777   64.47
10    1.73 -0.719513   66.28
11    1.75 -0.355508   68.10
12    1.78 -0.686687   69.92
13    1.80 -0.066433   72.19
14    1.83 -0.204805   74.46

[15 rows x 3 columns]

The Pandas DataFrame and Series objects provide very nice ways to deal with missing data, perform row/column-selects, and more.

Pandas describes the DataFrame as

[A] Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels.

The named columns are what we want. Statsmodels can automagically use them to build regression models.

Note: this uses statsmodels.formula.api not statsmodels.api

In [16]:

import statsmodels.formula.api as smf


mod = smf.ols(formula='Weight ~ Height + Nonsense', 
              data=df)

res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Weight   R-squared:                       0.990
Model:                            OLS   Adj. R-squared:                  0.989
Method:                 Least Squares   F-statistic:                     609.1
Date:                Tue, 01 Apr 2014   Prob (F-statistic):           8.62e-13
Time:                        07:51:54   Log-Likelihood:                -15.311
No. Observations:                  15   AIC:                             36.62
Df Residuals:                      12   BIC:                             38.75
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept    -37.9322      3.071    -12.350      0.000       -44.624   -31.240
Height        60.8164      1.802     33.753      0.000        56.891    64.742
Nonsense       0.7611      0.670      1.135      0.278        -0.699     2.222
==============================================================================
Omnibus:                        2.004   Durbin-Watson:                   0.806
Prob(Omnibus):                  0.367   Jarque-Bera (JB):                1.519
Skew:                           0.721   Prob(JB):                        0.468
Kurtosis:                       2.405   Cond. No.                         36.7
==============================================================================

Everything about this is pretty much the same, except we very quickly specified the linear model using a string!

"Weight ~ Height + Nonsense"

The formula works very much like an equation, but the tilde ("~") separates the right and left sides of the equation and denotes that there is a constant term as well (think the two sides are "proportional"). This term is called "Intercept" in the previous summary.

The power of the formula description is that it lets us go well beyond linear regression:

In [17]:

url = "http://vincentarelbundock.github.com/Rdatasets/csv/HistData/Guerry.csv"
df = pd.read_csv(url)
df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
print df.head()

   Lottery  Literacy  Wealth Region
0       41        37      73      E
1       38        51      22      N
2       66        13      61      C
3       80        46      76      E
4       79        69      83      E

[5 rows x 4 columns]

Multiplicative effects:

In [18]:

res = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
#print res.summary2()

Nonlinear functions:

In [19]:

res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
#print res.summary2()

But honestly these things are too much right now.

Here's a more compact view of the summary:

In [20]:

duncan_prestige = sm.datasets.get_rdataset("Duncan", "car")

In [21]:

import statsmodels.formula.api as smf


DF = duncan_prestige.data


mod = smf.ols(formula='prestige ~ education + income', data=DF)
res = mod.fit()
print res.summary()

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               prestige   R-squared:                       0.828
Model:                            OLS   Adj. R-squared:                  0.820
Method:                 Least Squares   F-statistic:                     101.2
Date:                Tue, 01 Apr 2014   Prob (F-statistic):           8.65e-17
Time:                        07:51:55   Log-Likelihood:                -178.98
No. Observations:                  45   AIC:                             364.0
Df Residuals:                      42   BIC:                             369.4
Df Model:                           2                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
Intercept     -6.0647      4.272     -1.420      0.163       -14.686     2.556
education      0.5458      0.098      5.555      0.000         0.348     0.744
income         0.5987      0.120      5.003      0.000         0.357     0.840
==============================================================================
Omnibus:                        1.279   Durbin-Watson:                   1.458
Prob(Omnibus):                  0.528   Jarque-Bera (JB):                0.520
Skew:                           0.155   Prob(JB):                        0.771
Kurtosis:                       3.426   Cond. No.                         163.
==============================================================================

In [22]:

mod = smf.ols(formula='prestige ~ education + income', data=DF)
res = mod.fit() 
print res.summary2()

             Results: Ordinary least squares
=========================================================
Model:              OLS      AIC:                363.9645
Dependent Variable: prestige BIC:                369.3844
No. Observations:   45       Log-Likelihood:     -178.98 
Df Model:           2        F-statistic:        101.2   
Df Residuals:       42       Prob (F-statistic): 8.65e-17
R-squared:          0.828    Scale:              178.73  
Adj. R-squared:     0.820                                
---------------------------------------------------------
           Coef.  Std.Err.    t    P>|t|   [0.025  0.975]
---------------------------------------------------------
Intercept -6.0647   4.2719 -1.4197 0.1631 -14.6858 2.5565
education  0.5458   0.0983  5.5554 0.0000   0.3476 0.7441
income     0.5987   0.1197  5.0033 0.0000   0.3572 0.8402
---------------------------------------------------------
Omnibus:            1.279     Durbin-Watson:        1.458
Prob(Omnibus):      0.528     Jarque-Bera (JB):     0.520
Skew:               0.155     Prob(JB):             0.771
Kurtosis:           3.426     Condition No.:        163  
=========================================================

Summary

If we have reason to believe a linear model is sufficient to explain a relationship between numeric data, we don't need to be restricted to 1D functions.

The "summary" view shown here is standardized across many different software packages, including those more focused on statistical analysis than python (R is the main example).

Learning to read these summaries makes it useful to see which coefficients are/are not significant.

We also discussed (briefly) pandas, which provides a nice "DataFrame" object for working with tabular data. --- We've only scratched the surface of what it can do.

DSV Lecture 20

2014-04-01

Last time

Today:

Linear regression

Multiple linear regression.

A note on jargon

Summary