Simple Linear Regression

In this lesson you will learn how to perform hypothesis tests for the intercept and slope in the simple linear regression model, how to calculate confidence intervals for these parameters as well as for the expected value of the response variable, and how to calculate a prediction interval for the response variable.

Formal Statement of Model

Suppose we have $n$ observations $(x_i, y_i)$, $i=1, \ldots, n$. Then the model assumes the relationship \[y_i =\beta_0+\beta_1 x_i + \epsilon_i\] where

$\beta_0$ and $\beta_1$ are some unknown numbers (called parameters),
$\epsilon_i$ is an unobserved random variable (also called error) which is normally distributed with mean 0 and variance $\sigma^2$, and
the errors are independent.

Note that the model contains three unknown parameters ($\beta_0$, $\beta_1$, and $\sigma^2$). One is typically interested in the first two while the last one is considered to be more of a nuisance and therefore sometimes called a nuisance parameter. The variable $y_i$ is a random random variable while the variable $x_i$ could be a constant or a random variable. However, the interpretation of the results depends on whether $x_i$ is assumed to be fixed or assumed to be random.

It might be confusing that we do not reflect the difference between a random variable and its observed value in the notation above.

As an example, consider the relationship between the price of a house measured in dollars and the amount of living space of the house measured in square feet. In this case the linear regression model assumes that the average price of a house for a given amount of living space is linearly related to the living space and that the deviations from the average prices are independent, normally distributed, and have the same standard deviation.

The $n$ equations, \[ \begin{eqnarray*} y_1 &=& \beta_0+\beta_1 x_1 + \epsilon_1\\ y_2 &=& \beta_0+\beta_1 x_2 + \epsilon_2\\ \vdots &=& \quad \quad \vdots \\ y_n &=& \beta_0+\beta_1 x_n + \epsilon_n \end{eqnarray*} \] can be written as a matrix equation \[ Y = X\beta + \epsilon \] where \[ Y = \left[\begin{array}{c} y_1\\ y_2\\ \vdots \\ y_n \end{array}\right], \quad X = \left[\begin{array}{cc} 1 & x_1\\ 1 & x_2\\ \vdots & \vdots \\ 1 & x_n \end{array}\right], \quad \beta = \left[\begin{array}{c} \beta_0\\ \beta_1 \end{array}\right], \quad \epsilon = \left[\begin{array}{c} \epsilon_1\\ \epsilon_2\\ \vdots \\ \epsilon_n \end{array}\right]. \]

The assumption that the errors are independent and normally distributed with mean 0 and variance $\sigma^2$ can be written as $\epsilon \sim N(0, \sigma^2 I)$, that is, $\epsilon$ has the multivariate normal distribution with mean 0 (0 vector) and variance-covariance matrix $\sigma^2 I$ ($I$ denotes the n by n identity matrix). In summary, we can write the model as \[ Y = X\beta + \epsilon, \quad \epsilon \sim N(0, \sigma^2 I). \]

One can show that the least squares estimates of $\beta$ is given by $\hat{\beta} = (X^T X)^{-1}X^T Y$, $E[\hat{\beta}] = \beta$, and that the variance-covariance matrix of $\hat{\beta}$ is $\sigma^2(X^T X)^{-1}$. If the columns of the matrix $X$ are not linearly independent then the inverse of the matrix $X^TX$ does not exist and needs to be replaced by the pseudo-inverse.

The point of all this is that linear algebra provides the most convenient way to derive the estimator of $\beta$ and its properties as well as methods for interval estimation and hypothesis testing for linear regression models. Since I cannot assume that you are familiar with linear algebra, I will not provide a derivation of the methods we are going to discuss next. Instead, I will just show you how to do the calculations using R and how to interpret the results.

In order to illustrate the methods with a small data set, I collected the price (in thousands of dollars) and the living space (measured in square feet) for a dozen homes for sale in Auburn (on June 18, 2016). The data is stored in homes.csv.

Confidence Intervals for Intercept and Slope

Confidence intervals for the intercept and slope can be calculated with the function confint(). By default, the function returns 95% two-sided confidence intervals.

homes <- read.csv("homes.csv")

fm <- lm(price ~ space, data=homes)

confint(fm)

In order to calculate a 90% two-sided confidence interval we can use the option level=0.9. Lower-bound and upper-bound confidence intervals can be found from the appropriate two-sided confidence interval. More specifically, a $100(1-\alpha)$% lower-bound (or upper-bound) confidence interval is found from the two-sided $100(1-2\alpha)$% confidence interval. To calculate 95% lower-bound confidence intervals for example we calculate a two-sided 90% confidence interval.

confint(fm, level=0.9)

A 95% lower-bound confidence interval for the slope is then given by $(0.09591061, \infty)$. This result can be interpreted as follows. One can be 95% confident that an increase of 1 square foot of living space changes the average price of a house by at least $95.91.

Hypothesis Tests for Intercept and Slope

Let $\gamma_0$ be some specified number. Then we can perform the following hypothesis tests for the intercept $\beta_0$.

\[H_0: \beta_0 = \gamma_0 \mbox{ versus } H_A: \beta_0 \neq \gamma_0,\]
\[H_0: \beta_0 \leq \gamma_0 \mbox{ versus } H_A: \beta_0 \gt \gamma_0,\]
\[H_0: \beta_0 \geq \gamma_0 \mbox{ versus } H_A: \beta_0 \lt \gamma_0.\]

Each of these tests can be performed by calculating the corresponding confidence interval. Recall that if you want to test at significance level $\alpha$ then you need to calculate the appropriate 100(1-$\alpha$)% confidence interval. You reject the null hypothesis in favor of the alternative hypothesis if the appropriate confidence interval does not contain any value that is consistent with the null hypothesis. For example, you reject $H_0: \beta_0 \leq \gamma_0$ in favor of $H_A: \beta_0 \gt \gamma_0$ at significance level $\alpha$ if the $100(1-\alpha)$% lower-bound confidence interval does not contain $\gamma_0$.

The same approach also works for performing hypothesis tests for the slope. In particular, one might ask if there is evidence that the explanatory variable is a useful component of the regression model. To answer this question we perform the test \[H_0: \beta_1 = 0 \mbox{ versus } H_A: \beta_1 \neq 0.\] Note that if there is no evidence that the slope $\beta_1$ is different from 0 then one might want to simplify the model from $y = \beta_0 + \beta_1 x + \epsilon$ to $y = \beta_0 + \epsilon$ by the parsimony principle (or equivalently Occam's razor).

Sometimes the p-value of the test is desired. If $\gamma_0 = 0$ then the p-value for the test equal versus not equal can be calculated with the function summary().

summary(fm)

The p-value for a one-sided test can be deduced from the two-sided test. Alternatively, it can be calculated from the value of the test statistic, which is provided in the output of summary(fm)), using the function pt().

In the case where the hypothesized value is not 0, the p-value can be calculated from the results of summary(fm)). To be specific consider the test \[H_0: \beta_1 = \gamma_0 \mbox{ versus } H_A: \beta_1 \neq \gamma_0.\] Then the test statistic is given by \[\frac{\hat{\beta}_1-\gamma_0}{\mbox{se}(\hat{\beta}_1)},\] where se($\hat{\beta}_1$) denotes the standard error of the estimate of $\beta_1$. Both values, $\hat{\beta}_1$ and se($\hat{\beta}_1)$, are provided in the output of summary(fm)). The p-value can then be calculated using the function pt().

To be specific, consider the test \[H_0: \beta_1 \leq 0.1 \mbox{ versus } H_A: \beta_1 \gt 0.1.\] for our data set.

test.statistic = (0.12792-0.1)/0.01766

1-pt(test.statistic, df=10)

Another way to calculate the p-value is to fit a modified model. For testing \[H_0: \beta_0 = \gamma_0 \mbox{ versus } H_A: \beta_0 \neq \gamma_0\] we fit the model $y-\gamma_0 = \beta_0 + \beta_1 x +\epsilon$.

Consider the test \[H_0: \beta_0 = 30 \mbox{ versus } H_A: \beta_0 \neq 30.\] for our data set.

fm2 <- lm(price-30 ~ space, data=homes)

summary(fm2)

The p-value is 0.994 and so there is no evidence that the intercept is different from 30.

For testing \[H_0: \beta_1 = \gamma_0 \mbox{ versus } H_A: \beta_1 \neq \gamma_0\] fit the model $y-\gamma_0 x = \beta_0 + \beta_1 x +\epsilon$.

Consider again the test \[H_0: \beta_1 \leq 0.1 \mbox{ versus } H_A: \beta_1 \gt 0.1.\] for our data set.

fm3 <- lm(price-0.1*space ~ space, data=homes)

summary(fm3)

The p-value for the two-sided test is 0.145 (rounded to three digits). From this we can deduce that the p-value for the one-sided test $H_0: \beta_1 \leq 0.1$ versus $H_A: \beta_1 \gt 0.1$ is $0.145/2 = 0.0725$. This matches our previous result.

Confidence Intervals for $E[Y]$

The predicted value of $y$ for a given value of $x$ can be calculated from the equation \[\hat{y} = \hat{\beta}_0+ \hat{\beta}_0x.\] This predicted value $\hat{y}$ is an estimate of the average value of $y$ for the given value of $x$.

new.house <- data.frame(space=6000)

predict(fm, new.house)

Instead of a point estimate of the value $E[y] = \beta_0+\beta_1 x$, we may want an interval estimate. Confidence intervals for $E[y]$ are calculated with the predict() function by specifying the option interval="confidence".

predict(fm, new.house, interval="confidence")

We can be 95% confident that the average price for a 6000 square feet home in this neighborhood will be between 694K and 902K.

Prediction Intervals for $Y$

Interval estimates of $Y$ are called prediction intervals. The term confidence interval is reserved for an interval estimate of a fixed quantity. Prediction intervals can be calculated with the function predict() using the option interval="prediction"

predict(fm, new.house, interval="prediction")

If a new home comes on the market with a living space of 6000 square feet, then we can be 95% confident that its price will be between 457K and 1,138K.

Checking Model Assumptions

The model assumptions are checked by considering the residuals. To check the independence assumption plot the residuals against the order the observations were made.

plot(residuals(fm), type='l')

A pattern that is not random suggests a lack of independence. The residuals could be plotted against the predicted values to check the "constant variance" assumption.

plot(predict(fm), residuals(fm))

A qq-plot of the residuals would show violations of the normal distribution assumption.

qqnorm(residuals(fm))

Fitting the linear Model without Intercept

Sometimes it is desirable to fit the linear model $y = \beta_1 x + \epsilon$. To remove the intercept we can use either $y\sim x-1$ or $y\sim 0+x$.

lm(price ~ 0+space, data=homes)