Dated Revised
Category Science

This note is to remind myself about some of the basic concepts of linear regression.

Bibliography

[1] James, Witten, Hastie, Tibshirani - Introduction to Statistical Learning

## Setup¶

In [1]:
import numpy as np
import scipy.stats
import pandas as pd
import matplotlib.pylab as plt

In [2]:
%matplotlib inline

In [3]:
(blue, orange, red, green, purple, brown, pink, yellow, lightred, lightblue,
lightorange, lightgreen, lightpurple) = \
('#377eb8', '#ff7f00', '#e41a1c', '#4daf4a', '#984ea3', '#a65628', '#f781bf',
'#d2d215', '#fb9a99', '#a6cee3', '#fdbf6f', '#b2df8a', '#cab2d6')


## Creating some (noisy) toy data¶

In [4]:
def create_data(b0, b1, noise_sd, n=100):
"""Return a dataframe with n rows and columns 'X' and 'Y'
The dataframe will contain random data points that follow
Y = b1 * X + b0 + eps where the error eps is drawn from
a normal distribution with standard deviation noise_sd
"""
x = np.sort(10 * np.random.rand(n))
y = b1 * x + b0
noise = scipy.stats.norm(scale=noise_sd)
y += noise.rvs(len(x))
return pd.DataFrame({'X': x, 'Y': y})

In [5]:
b0_true = 0.3
b1_true = 1.2
noise_sd = 2.0  # σ in the standard distribution

data = create_data(b0_true, b1_true, noise_sd)

def plot_data(data, *lines, show=True, **kwargs):
fig, ax = plt.subplots(**kwargs)
data.plot.scatter('X', 'Y', ax=ax)
for (b0, b1, label, color) in lines:
ax.plot(
np.linspace(0, 10, 5),
b0 + np.linspace(0, 10, 5) * b1,
color=color, label=label)
ax.legend()
if show:
plt.show(fig)
else:
return fig, ax

plot_data(data, (b0_true, b1_true, 'true line', 'red'))


## Linear Regression with statsmodels¶

In [6]:
import statsmodels.api as sm

In [7]:
mod = sm.OLS(data['Y'], sm.add_constant(data['X']))

In [8]:
res = mod.fit()
res.summary()

Out[8]:
Dep. Variable: R-squared: Y 0.699 OLS 0.696 Least Squares 228.0 Sun, 22 Jan 2023 2.55e-27 12:08:04 -207.72 100 419.4 98 424.7 1 nonrobust
coef std err t P>|t| [0.025 0.975] 1.3407 0.346 3.876 0.000 0.654 2.027 1.0144 0.067 15.098 0.000 0.881 1.148
 Omnibus: Durbin-Watson: 2.898 2.206 0.235 2.676 -0.4 0.262 2.96 9.36

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [9]:
res.params

Out[9]:
const    1.340737
X        1.014445
dtype: float64

## Manual Linear Regression¶

In [10]:
def linear_regression(data):
"""Estimated b0_hat, b1_hat for the given data.

See [1] Eq. (3.4)
"""
xbar = data['X'].mean()
ybar = data['Y'].mean()
delta_x = data['X'] - xbar
delta_y = data['Y'] - ybar
b1_hat = np.dot(delta_x, delta_y) / np.dot(delta_x, delta_x)
b0_hat = ybar - b1_hat * xbar
return b0_hat, b1_hat

In [11]:
b0_hat, b1_hat = linear_regression(data)

In [12]:
b0_hat, b1_hat

Out[12]:
(1.3407370728448775, 1.0144446957055053)
In [13]:
b0_true, b1_true

Out[13]:
(0.3, 1.2)
In [14]:
plot_data(
data,
(b0_true, b1_true, 'true line ("population regression line")', red),
(b0_hat, b1_hat, 'fit ("least squares line")', blue)
)


## Goodness of Fit¶

### Standard Error¶

The Standard Error describes how close an estimated quantity is expected to be to the actual quantity.

It is the standard deviation of the "sampling distribution" of a statistical quantity: If you take a large number of independent samples, each of size $n$, and calculate the statistical quantity for each sample individually, then the standard errror is the standard deviation of the distribution of those results. It is always $\sigma/\sqrt{n}$, where $\sigma$ is the (possibly unknown) standard deviation of the original distribution. Standard errors go to zero as sample sizes go to infinity; standard deviations do not.

In [15]:
def residuals(data, b0, b1):
"""Array of residuals from the perfect linear relation Y=b1*X+b0"""
Y_hat = b1 * data['X'] + b0
return data['Y'] - Y_hat

In [16]:
def RSS(data, b0, b1):
"""Residual Sum of Squares

How much the Y values vary around the predicted Y values
"""
return (residuals(data, b0, b1)**2).sum()

In [17]:
def TSS(data):
"""Total Sum of Squares

How much the Y values vary around the mean Y value
"""
return ((data['Y'] - data['Y'].mean())**2).sum()

In [18]:
def RSE(data, b0, b1):
"""Residual standard error

This is an estimate for the standard deviation σ of the true
distribution (noise_sd). The y-values in the sample deviate from the
from the true regression line by about RSE units on average.
"""
return np.sqrt(RSS(data, b0, b1) / (len(data)-2))

In [19]:
RSS(data, b0_hat, b1_hat)

Out[19]:
373.07675962480016
In [20]:
RSE(data, b0_hat, b1_hat)

Out[20]:
1.9511293423187364
In [21]:
noise_sd

Out[21]:
2.0
In [22]:
def linear_regression_SE(data, sig):
"""Calculate the standard error for the estimate of b0, b1 in a
linear regression for the given data, assuming a standard deviation of
sig for the distribution of noise on Y. As sig is generally not known,
it should be estimated by the RSE.

Note that the standard error is a property of the sampling only:
The Y-values do not enter!

See [1] Eq. (3.8)
"""
n = len(data)
xbar = data['X'].mean()
delta_x = data['X'] - xbar
denom = np.dot(delta_x, delta_x)
se_b0_sq = sig**2 * (1.0 / n + xbar**2/denom)
se_b1_sq = sig**2 / denom
return np.sqrt(se_b0_sq), np.sqrt(se_b1_sq)

In [23]:
linear_regression_SE(data, RSE(data, b0_hat, b1_hat))

Out[23]:
(0.34590880729195916, 0.06718913728836097)

The 95% confidence intervals is $\pm$ two standard errors (probability that the true value is in the range of the confidence interval)

### Confidence Interval plot¶

In [24]:
def regline_SE(data, sig, xgrid):
"""The standard error of the full regression line, assuming
a standard deviation of sig for the distribution of noise on Y. As
sig is generally not known, it should be estimated by the RSE.
"""
n = len(data)
return sig * np.sqrt(
(1.0 / n) + (
(xgrid - data['X'].mean())**2 /
((data['X'] - data['X'].mean())**2).sum()))


The 95% confidence band can be calculated as:

In [25]:
def regline_confidence_bands(data, b0_hat, b1_hat, xgrid, confidence=0.95):
"""Return two lines (arrays of y-values), the lower and upper boundary
of the confidence band"""
Y_hat = b0_hat + b1_hat * xgrid  # the fitted line, on the given xgrid
n = len(data)
sig = RSE(data, b0_hat, b1_hat)
se = regline_SE(data, sig, xgrid)
t_minus, t_plus = scipy.stats.t(df=n-2).interval(confidence)
return Y_hat + t_minus * se, Y_hat + t_plus * se

In [26]:
fig, ax = plot_data(
data,
(b0_true, b1_true, 'true line ("population regression line")', red),
(b0_hat, b1_hat, 'fit ("least squares line")', blue),
show=False, figsize=(10, 6)
)
xgrid = np.linspace(0, 10, 100)
lm, lp = regline_confidence_bands(data, b0_hat, b1_hat, xgrid)
ax.fill_between(
xgrid, lm, lp,
color='black', alpha=0.2, label='95% confidence band')
ax.legend()
plt.show(fig)


The confidence band should be interpreted as follows: There is a 95% probability that a fit based on any given sampling of the same distribution will land in the interval.

### Testing the null hypothesis¶

The null-hypthesis is that Y and X have no reltionship, i.e. b1 = 0. The $t$-value is the b1 in units of the standard error:

In [27]:
t = b1_hat / linear_regression_SE(data, RSE(data, b0_hat, b1_hat))[1]
t

Out[27]:
15.09834381935479

The $t$-value follows a student distribution, and we can calculate a p-value (probability that $t$ has the given value, instead of 0)

In [28]:
2 * scipy.stats.t(df=len(data)-2).sf(t)  # = 1 - (cdf(t) - cdf(-t))

Out[28]:
2.5471330315828787e-27

### The $R^2$ statistic¶

In [29]:
def R_sq(data, b0, b1):
"""R^2 in [0, 1] measures how well the variation in Y around the mean
matches the variation in Y around the predicted value.

That is, how much of the variability can be explained by the sampling"""
return 1 - RSS(data, b0, b1) / TSS(data)

In [30]:
R_sq(data, b0_hat, b1_hat)

Out[30]:
0.6993496006172107

A value close to 1 means that the linear fit is very good

### Normality of the residuals¶

A graphical way to check that the residuals are normal-distributed is to plot the quantiles of the residules against the quantiles of a normal distribution, see https://stats.stackexchange.com/questions/321061/probability-that-residuals-are-normal/321071#321071

In [31]:
fig, ax = plt.subplots()
ax.plot(
scipy.stats.norm().ppf(np.linspace(0, 1, len(data))),
residuals(data, b0_hat, b1_hat).sort_values() / RSE(data, b0_hat, b1_hat))
ax.set_title('QQ plot')
ax.plot(
np.linspace(-3, 3, 10), np.linspace(-3, 3, 10), color='black',
ls='dashed')
ax.set_xlabel('theoretical quantiles')
ax.set_ylabel('sample quantiles')
plt.show(fig)


## Fit with known slope¶

In [32]:
b0_known_slope = np.mean(data['Y'] - b1_true*data['X'])

In [33]:
b0_known_slope

Out[33]:
0.5519198289545456
In [34]:
b0_true

Out[34]:
0.3
In [35]:
plot_data(
data,
(b0_true, b1_true, 'true line', red),
(b0_hat, b1_hat, 'fit', blue),
(b0_known_slope, b1_true, 'fit with known slope', orange),
show=True,
)