# make figures better:
font = {'weight':'normal','size':16}
matplotlib.rc('font', **font)
matplotlib.rc('figure', figsize=(9.0, 6.0))
matplotlib.rc('xtick.major', pad=8) # xticks too close to border!
matplotlib.rc('ytick.major', pad=10) # xticks too close to border!
matplotlib.rc('xtick',labelsize='small')
matplotlib.rc('ytick',labelsize='small')
import warnings
warnings.filterwarnings('ignore')
import random, math
import numpy as np
import scipy, scipy.stats
import matplotlib.pyplot as plt
nicered = "#E6072A"
niceblu = "#424FA4"
nicegrn = "#6DC048"
James Bagrow, james.bagrow@uvm.edu, http://bagrow.com
We discussed linear regression way back in Lecture 9.
First, let's recap:
The goal with linear regression was to find the line that best fit some given XY-data (and test if that fit was significant). For the \(i\)th data point:
\[y_i = \beta_0 + \beta_1 x_i + \epsilon_i\]
We want to find the pair of values (\(\beta_0\),\(\beta_1\)) that minimize the sum of square errors \(S\):
\[S(\beta) = \sum_i \epsilon_i^2 = \sum_i \left[y_i - \left(\beta_0 + \beta_1 x_i\right)\right]^2\]
Of course, this holds for all the data points together:
\[y_1 = \beta_0 + \beta_1 x_1 + \epsilon_1\\ y_2 = \beta_0 + \beta_1 x_2 + \epsilon_2\\ y_3 = \beta_0 + \beta_1 x_3 + \epsilon_3\\ \vdots \\ y_n = \beta_0 + \beta_1 x_n + \epsilon_n \]
We can write this in matrix notation:
Now a single matrix equation captures all individual linear regression equations. Furthermore, if we want to estimate the \(\mathbf{\beta}\)'s with Ordinary Least Squares (OLS), we have a (relatively) simple way to write down the best \(\mathbf{beta}\)'s.
In matrix form the residuals are:
\[\mathbf{\epsilon} = \mathbf{y} - \mathbf{X}\mathbf{\beta}\]
The sum of the squared residuals is then (using the inner product):
\[\mathbf{\epsilon}^\mathrm{T}\mathbf{\epsilon} = \left(\mathbf{y} - \mathbf{X}\mathbf{\beta}\right)^\mathrm{T}\left(\mathbf{y} - \mathbf{X}\mathbf{\beta}\right)\]
where \(\mathbf{\epsilon}^\mathrm{T}\) is the transpose of \(\mathbf{\epsilon}\).
With a little calculus and elbow grease we can find the minimum of this equation, which can be solved to give us the best estimates for \(\mathbf{\beta}\), called \(\mathbf{\hat{\beta}}\):
\[\mathbf{\hat{\beta}} = \left(\mathbf{X}^\mathrm{T}\mathbf{X}\right)^{-1}\mathbf{X}^\mathrm{T} \mathbf{y}\]
assuming \(\mathbf{X}^\mathrm{T}\mathbf{X}\) is nonsingular, meaning we can compute the matrix inverse.
We already computed the values of \(\beta_0\) and \(\beta_1\), so why are we now getting into all this matrix nonsense?
Because the above derivation is not limited to two coefficients and a linear regression of the form \(y = \beta_0 + \beta_1 x\).
Until now, we've been doing simple linear regression.
There is no reason why we need to stop our equation at two coefficients. What about this:
\[\begin{bmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_{11} & x_{12} & \cdots & x_{1p} \\ 1 & x_{21} & x_{22} & \cdots & x_{2p} \\ 1 & \vdots & \vdots & \ddots & \vdots \\ 1 & x_{n1} & x_{n2} & \cdots & x_{np} \end{bmatrix} \begin{bmatrix} \beta_0 \\ \beta_1 \\ \vdots \\ \beta_p \end{bmatrix} + \begin{bmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{bmatrix} \]
In matrix form that is exactly the same equation:
\[\mathbf{y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}\]
except that \(\mathbf{X}\) has \(p+1\) columns and \(\mathbf{\beta}\) has \(p+1\) rows.
*** For example, we may have reason to study a linear model relating a person's height and weight:
\[\mathrm{weight} = \beta_0 + \beta_1 \mathrm{height}\]
Now we can also incorporate (probably unrealistically) age data as well:
\[\mathrm{weight} = \beta_0 + \beta_1 \mathrm{height} + \beta_2 \mathrm{age}\]
We have a new coefficient, and we have a new column (the age column) in our data matrix. But estimating \(\mathbf{\hat{\beta}}\) remains the same. ***
There's a lot of names for the quantities in these equations, it's hard to keep straight.
The \(y_i\)'s are called:
The \(x_i\)'s (including the constant \(1\)) are called:
The matrix of data \(\mathbf{X}\) is called the design matrix.
The \(\beta\)'s are called:
The vector \(\mathbf{\beta}\) is called the paramter or coefficient vector.
The \(\epsilon\)'s are called errors, error term, or noise.
Let's see this in action:
We're going to use a new library called statsmodels.
First let's generate some data:
nobs = 100
b = [1.0,0.1,0.5] # betas
Xlist = []
Ylist = []
for _ in range(nobs):
# build the individual observation:
x1 = np.random.random() # a single number
x2 = np.random.random()
e = np.random.randn()*0.05 # random noise
y = 1.0*b[0] + x1*b[1] + x2*b[2] + e
# add it to the matrices:
Xlist.append([1.0,x1,x2]) # design matrix/exog variables
Ylist.append(y) # endogenous variables
Let's even print the first few lines:
for i in range(5):
print Ylist[i], "-->", Xlist[i]
print
print "Size Y =", len(Ylist)
print "Size X =", len(Xlist)
Looks good.
If you're comfortable with matrix operations, you can also build the data using matrix multiplication:
X = np.hstack(( np.ones((nobs,1)), np.random.random((nobs,2)) ))
beta = [1, 0.1, 0.5]
e = np.random.random(nobs)*0.05
y = np.dot(X, beta) + e
print "(rows, columns)"
print X.shape
print y.shape
But matrix operations are not very comfortable in python (at least compared to MATLAB), so you might just want to avoid them. It's up to you.
Anyway, now that we've got the data generated, let's use statsmodels OLS
function to find the \(\mathbf{\hat{\beta}}\):
import statsmodels.api as sm
# Fit regression model
model = sm.OLS(y,X) # our matrix data
#model = sm.OLS(Ylist, Xlist) # our lists of data
result = model.fit()
Wow, that was pretty easy, but we now we need to understand what results
is...
model.fit
returns an object that contains a bunch of data about the fit, residuals, coefficient values, and other useful statistics. It also has a few methods attached to it. One very handy method it has is summary
:
print result.summary()
This type of display is very common across statistical packages. There's three main panels.
The top panel contains information about the size of the data and the overall accuracy of the fit, as well as some nice records like time of the calculation.
The middle panel contains information about each coefficient of the fit.
The bottom panel contains information about the residuals \(\epsilon_i\) of the fit, are they normally distributed (omnibus test), do they possess serial correlations (Durbin-Watson), etc.
We can also access the information shown in the summary directly:
print result.params # the coefficients
print result.rsquared
Here are all the things result
has:
for name in dir(result):
if "__" in name: # skip "private" stuff
continue
print name
There's lots of data we can interact with regarding the fit...
Here's another example. In this case we are going to use sm.add_constant
to more quickly add the column of ones to the design matrix.
Height = [1.47,1.50,1.52,1.55,1.57,1.60,1.63,1.65,
1.68,1.70,1.73,1.75,1.78,1.80,1.83]
Nonsense = list(np.random.random(len(Height))-1)
Weight = [52.21,53.12,54.48,55.84,57.20,58.57,59.93,
61.29,63.11,64.47,66.28,68.10,69.92,72.19,74.46]
# stack exog variables into design matrix
X = np.column_stack( (Height,Nonsense) )
X = sm.add_constant(X, prepend=True) # add a column of 1's out front
res = sm.OLS(Weight,X).fit() # create a model and fit it
print res.summary()
Let's make a quickie plot of the solution and the data:
from mpl_toolkits.mplot3d import Axes3D
from JSAnimation import IPython_display
from matplotlib import animation
fig = plt.figure(figsize=(8,5))
ax = Axes3D(fig)
x = np.linspace(min(Height),max(Height),50)
y = np.linspace(min(Nonsense),max(Nonsense),50)
x, y = np.meshgrid(x,y)
z = -38.9522 + 61.1270*x -0.2821*y
ax.scatter(Height,Nonsense,Weight, c='r', s=50)
ax.plot_surface(x,y,z, alpha=0.5, lw=0)
ax.set_xlabel("Height")
ax.set_ylabel("Nonsense")
ax.set_zlabel("Weight")
# point-of-view: (altitude degrees, azimuth degrees)
ax.view_init(30, 0)
def animate(t):
# update POV:
ax.view_init(30 + np.sin(2*np.pi*t/120.0)*30,
3*t )
return []
animation.FuncAnimation(fig, animate,
frames=120, # number of frames to draw
interval=40, # time (ms) on each frame
blit=True)
OK, so we see that, since the "Nonsense" axis isn't meaningful, our regression is essentially a flat plane. Let's do an example with real data, where both variables matter:
data = np.genfromtxt("anaesthesia_1959.csv", dtype=float,
delimiter=',', names=True)
# load each column into separate array:
LogDose = data["LogDose"]
BP = data["BP"]
RecovTime = data["RecovTime"]
This data measures the time to recovery from anaesthetic as a function of blood pressure and (the logarithm) of the dosage.
Fit it up!!!
# stack exog variables into design matrix
X = np.column_stack( (LogDose,BP/10.0) )
X = sm.add_constant(X, prepend=True) # add a column of 1's out front
res = sm.OLS(RecovTime,X).fit() # create a model and fit it
print res.summary()
Now let's plot:
fig = plt.figure(figsize=(8,5))
ax = Axes3D(fig)
x = np.linspace(min(LogDose),max(LogDose),50)
y = np.linspace(min(BP/10.0),max(BP/10.0),50)
x, y = np.meshgrid(x,y)
z = 22.2712 + 10.6399*x -7.4009*y
ax.scatter(LogDose,BP/10.0,RecovTime, c='r', s=50)
ax.plot_surface(x,y,z, alpha=0.5, lw=0)
ax.set_xlabel("LogDose")
ax.set_ylabel("Blood Pressure/10")
ax.set_zlabel("Recovery Time")
# point-of-view: (altitude degrees, azimuth degrees)
ax.view_init(30, 0)
def animate(t):
# update POV:
ax.view_init(30 + np.sin(2*np.pi*t/120.0)*40,
3*t )
return []
animation.FuncAnimation(fig, animate,
frames=120, # number of frames to draw
interval=40, # time (ms) on each frame
blit=True)