Knut Ekwall (1843-1912) was a Swedish painter.
We have sections
Regression analysis is a statistical tool that utilizes the relation between two or more quantitative variables such that one variable can be predicted from the other, or others. For example, if one knows the relation between advertising expenditures and sales, one can predict sales by regression analysis once the level of advertising expenditures has been set. A functional relation between two variables is expressed by a mathematical formula. If \(X\) is the independent variable and \(Y\) the dependent variable, a functional relation is of the form: \(Y=f(X)\). Given a particular value of \(X\), the function \(f\) indicates the corresponding value of \(Y\). Here we consider \(f\) as a linear function, i.e. \(f(X)=\beta_{0}+\beta_{1}X\).
There is often interest in the relation between two variables, for example, a student’s scholastic aptitude test scores in mathematics and this same student’s grade in calculus. Frequently, one of these variables, say \(x\), is known in advance of the other, and hence there is interest in predicting a future random variable \(Y\). Since \(Y\) is a random variable, we cannot predict its future observed value \(Y=y\) with certainty. Let us first concentrate on the problem of estimating the mean of \(Y\), i.e., \(\mathbb{E}(Y)\). Now \(\mathbb{E}(Y)\) is usually a function of \(x\). In our illustration with the calculus grade, say \(Y\), we would expect \(\mathbb{E}(Y)\) to increase with increasing mathematics aptitude score \(x\). Sometimes \(\mathbb{E}(Y)=\mu (x)\) is assumed to be of a given form such as linear or quadratic or exponential; that is, \(\mu (x)\) could be assumed to be equal to \(\alpha +\beta x\) or \(\alpha +\beta x+\gamma^{2}x\) or \(\alpha e^{\beta x}\). To estimate \(\mathbb{E}(Y)=\mu (x)\), or equivalently the parameters \(\alpha\), \(\beta\), and \(\gamma\), we observe the random variable \(Y\) for each of \(n\) different values of \(x\), say \(x_{1},\cdots ,x_{n}\). Once the \(n\) independent experiments have been performed, we have \(n\) pair of known numbers \((x_{1},y_{1}),\cdots ,(x_{n},y_{n})\). These pairs are used to estimate the mean \(\mathbb{E}(Y)\). Problem like this are often classified under regression because \(\mathbb{E}(Y)=\mu (x)\) is frequently called a regression curve. A model for the mean like \(\alpha +\beta x+\gamma x^{2}\), is called a linear model, since it is linear in the parameters \(\alpha\), \(\beta\), and \(\gamma\). Therefore \(\alpha e^{\beta x}\) is not a linear model, since it is not linear in \(\alpha\) and \(\beta\).
\begin{equation}{\label{a}}\tag{A}\mbox{}\end{equation}
Simple Linear Regression Model.
Let us begin with the case in which \(\mathbb{E}(Y)=\mu (x)\) is a linear function. The data points are \((x_{1},y_{1}),\cdots ,(x_{n},y_{n})\). In addition to assuming that the mean of \(Y\) is a linear function, we assume that for a particular value of \(x\), the value of \(Y\) will differ from its mean by a random amount \(\varepsilon\). We further assume that the distribution of \(\varepsilon\) is \(N(0,\sigma^{2})\). Therefore, we have the following linear model
\[Y_{i}=\alpha_{1}+\beta x_{i}+\varepsilon_{i},\]
where \(\varepsilon_{i}\) for \(i=1,\cdots ,n\) are independent and \(N(0,\sigma^{2})\). We shall now find point estimates for \(\alpha_{1}\),
$\beta$, and \(\sigma^{2}\). For convenience, let \(\alpha_{1}=\alpha -\beta\bar{x}\) such that
\[Y_{i}=\alpha +\beta (x_{i}-\bar{x})+\varepsilon_{i},\]
where
\[\bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_{i}.\]
Then \(Y_{i}\) is equal to a constant, \(\alpha +\beta (x_{i}-\bar{x})\), plus a normal random variable \(\varepsilon_{i}\). Hence \(Y_{1},\cdots ,Y_{n}\) are mutually independent normal random variables with respective means \(\alpha +\beta (x_{i}-\bar{x})\) for \(i=1,\cdots ,n\), and unknown variance \(\sigma^{2}\). Their joint p.d.f. is therefore the product of the individual probability density function. The likelihood function equals
\begin{align*}
L(\alpha ,\beta ,\sigma^{2}) & =\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left [-\frac{(y_{i}-\alpha -\beta (x_{i}-\bar{x})^{2}}{2\sigma^{2}}\right ]\\
& =\left (\frac{1}{2\pi\sigma^{2}}\right )^{n/2}\exp\left [-\frac{\sum_{i=1}^{n} (y_{i}-\alpha -\beta (x_{i}-\bar{x})^{2}}{2\sigma^{2}}\right ].
\end{align*}
To maximize, or, equivalently, to minimize
\[-\ln L(\alpha ,\beta ,\sigma^{2})=\frac{n}{2}\ln (2\pi\sigma^{2})+\frac{\sum_{i=1}^{n} (y_{i}-\alpha -\beta (x_{i}-\bar{x}))^{2}}{2\sigma^{2}},\]
we must select \(\alpha\) and \(\beta\) to minimize
\[H(\alpha ,\beta )=\sum_{i=1}^{n} (y_{i}-\alpha -\beta (x_{i}-\bar{X}))^{2}.\]
Since
\[|y_{i}-\alpha -\beta (x_{i}-\bar{x})|=|y_{i}-\mu (x_{i})|\]
is the vertical distance from the point \((x_{i},y_{i})\) to the line \(y=\mu (x)\), we note that \(H(\alpha ,\beta )\) represents the sum of the squares of those distances. Therefore, selecting \(\alpha\) and \(\beta\) such that the sum of the squares is minimized means that we are fitting the straight line to the data by the method of least squares. To minimize \(H(\alpha ,\beta )\), we find the two first partial derivatives
\[\frac{\partial H(\alpha ,\beta )}{\partial\alpha}=2\sum_{i=1}^{n}(y_{i}-\alpha -\beta(x_{i}-\bar{x}))(-1)\]
and
\[\frac{\partial H(\alpha ,\beta )}{\partial\beta}=2\sum_{i=1}^{n}(y_{i}-\alpha -\beta(x_{i}-\bar{x}))(-(x_{i}-\bar{x})).\]
Setting \(\partial H(\alpha ,\beta )/\partial\alpha =0\), we obtain
\[\sum_{i=1}^{n} y_{i}-n\alpha -\beta\sum_{i=1}^{n} (x_{i}-\bar{x})=0.\]
Since
\[\sum_{i=1}^{n} (x_{i}-\bar{x})=0,\]
we have
\[\sum_{i=1}^{n} y_{i}-n\alpha =0,\]
which says \(\widehat{\alpha}=\bar{Y}\). The equation \(\partial H(\alpha ,\beta )/\partial\beta =0\) yields, with \(\alpha\) replaced by \(\bar{y}\),
\[\sum_{i=1}^{n} (y_{i}-\bar{y})(x_{i}-\bar{x})-\beta\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}=0\]
or, equivalently,
\[\widehat{\beta}=\frac{\sum_{i=1}^{n} (Y_{i}-\bar{Y})(x_{i}-\bar{x})}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}=\frac{Y_{i}(x_{i}-\bar{x})}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}.\]
To find the mean line of best fit for
\[\mu (x)=\alpha +\beta (x_{i}-\bar{x}),\]
we use \(\widehat{\alpha}=\bar{y}\) and
\begin{align*} \widehat{\beta} & =\frac{\sum_{i=1}^{n}y{i}(x_{i}-\bar{x})}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\\ & =\frac{\sum_{i=1}^{n} x_{i}y_{i}-\left (\frac{1}{n}\right )\left (\sum_{i=1}^{n} x_{i}\right )\left (\sum_{i=1}^{n} y_{i}\right )}
{\sum_{i=1}^{n} x_{i}^{2}-\left (\frac{1}{n}\right )\left (\sum_{i=1}^{n}x_{i}\right )^{2}}.\end{align*}
To find the maximum likelihood estimator of \(\sigma^{2}\), we consider the partial derivative
\[\frac{\partial (-\ln L(\alpha ,\beta ,\sigma^{2}))}{\partial (\sigma^{2})}=\frac{n}{2\sigma^{2}}-\frac{\sum_{i=1}^{n} (y_{i}-\alpha -\beta (x_{i}-\bar{x}))^{2}}{2(\sigma^{2})^{2}}.\]
Setting this equal to zero and replacing \(\alpha\) and \(\beta\) by their solutions \(\widehat{\alpha}\) and \(\widehat{\beta}\), we obtain
\begin{equation}{\label{ch5eq2}}\tag{1}
\widehat{\sigma^{2}}=\frac{1}{n}\sum_{i=1}^{n} (Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x}))^{2}.
\end{equation}
A formula for \(n\widehat{\sigma^{2}}\) useful in its calculation is
\[n\widehat{\sigma^{2}}=\sum_{i=1}^{n} y_{i}^{2}-\left (\frac{1}{n}\right )
\left (\sum_{i=1}^{n} y_{i}\right )^{2}-\widehat{\beta}\left (
\sum_{i=1}^{n} x_{i}y_{i}\right )+\widehat{\beta}\left (\frac{1}{n}\right )
\left (\sum_{i=1}^{n} x_{i}\right )\left (\sum_{i=1}^{n} y_{i}\right ).\]
Note that the summand in Equation (\ref{ch5eq2}) for \(\widehat{\sigma^{2}}\) is the square of the difference between the value of \(Y_{i}\) and the predicted mean of \(Y_{i}\). Let
\[\widehat{Y}_{i}=\widehat{\alpha}+\widehat{\beta}(x_{i}-\bar{x}),\]
the predicted mean value of \(Y_{i}\). The difference
\[Y_{i}-\widehat{Y}_{i}=Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x})\]
is called the \(i\)th residual for \(i=1,\cdots ,n\). The maximum likelihood estimate of \(\sigma^{2}\) is then the sum of the squares of the residuals divided by \(n\). It should always be true that the sum of the residuals is equal to zero. However, in practice, due to round off, the sum of the observed residuals, \(y_{i}-\widehat{y}_{i}\), sometimes differs slightly from zero.
\begin{equation}{\label{ch5ex5}}\tag{2}\mbox{}\end{equation}
Example \ref{ch5ex5}. The test scores of \(10\) students in a psychology class are shown in the following table in which \(x\) denotes the score on a preliminary test and \(y\) denotes the score on the final examination.
\[\begin{array}{cccccccc}
\hline x & y & x^{2} & xy & y^{2} & \widehat{y} & y-\widehat{y} & (y-\widehat{y})^{2}\\
\hline 70 & 77 & 4900 & 5390 & 5929 & 82.561566 & -5.561566 & 30.931016\\
74 & 94 & 5476 & 6956 & 8836 & 85.529956 & 8.470044 & 71.741645\\
72 & 88 & 5184 & 6336 & 7744 & 84.045761 & 3.954239 & 15.636006\\
68 & 80 & 4624 & 5440 & 6400 & 81.077371 & -1.077371 & 1.160728\\
58 & 71 & 3364 & 4118 & 5041 & 73.656395 & -2.656395 & 7.056434\\
54 & 76 & 2916 & 4104 & 5776 & 70.688004 & 5.311996 & 28.217302\\
82 & 88 & 6724 & 7216 & 7744 & 91.466737 & -3.466737 & 12.018625\\
64 & 80 & 4096 & 5120 & 6400 & 78.108980 & 1.891020 & 3.575957\\
80 & 90 & 6400 & 7200 & 8100 & 89.982542 & 0.017458 & 0.000305\\
61 & 69 & 3721 & 4209 & 4761 & 75.882687 & -6.882687 & 47.371380\\
683 & 813 & 47405 & 56089 & 66731 & & 0.000001 & 217.709038\\
\hline\end{array}\]
The estimates of \(\alpha\) and \(\beta\) have to be found before the residuals can be calculated. Therefore, we have
\[\widehat{\alpha}=\frac{813}{10}=81.3\mbox{ and }\widehat{\beta}=\frac{56089-683\cdot 813/10}{47405-683\cdot 683/10}=\frac{561.1}{756.1}=0.742.\]
Since \(\bar{x}=683/10=68.3\), the least squares regression line is
\[\widehat{y}=81.3+0.742(x-68.3).\]
The maximum likelihood estimate of \(\sigma^{2}\) is
\[\widehat{\sigma^{2}}=\frac{217.709038}{10}=21.7709.\]
We shall now consider the problem of finding the distributions of \(\widehat{\alpha}\) and \(\widehat{\beta}\), and \(\widehat{\sigma^{2}}\) (or distributions of functions of these estimators). During the preceding discussion we have treated \(x_{1},\cdots ,x_{n}\) as constants. Many times they can be set by the experimenters; for example, a chemist in experimentation might produce a compound at many different temperatures. But it is also true that these numbers might be observations on an earlier random variables, such as an SAT score or preliminary test grade, but we consider the problem on the condition that these \(x\) values are given in either case. Thus, in finding the distributions of \(\widehat{\alpha}\), \(\widehat{\beta}\) and \(\widehat{\sigma^{2}}\), the only random variables are \(Y_{1},\cdots ,Y_{n}\). Since \(\widehat{\alpha}\) is a linear function of independent and normally distributed random variables, \(\widehat{\alpha}\) has a normal distribution with mean
\begin{align*} \mathbb{E}(\widehat{\alpha}) & =\mathbb{E}\left (\frac{1}{n}\sum_{i=1}^{n} Y_{i}\right )\\ & =\frac{1}{n}\sum_{i=1}^{n} \mathbb{E}(Y_{i})\\ & =\frac{1}{n}\sum_{i=1}^{n} (\alpha +\beta (x_{i}-\bar{x}))=\alpha ,\end{align*}
and variance
\[\mbox{Var}(\widehat{\alpha})=\sum_{i=1}^{n}\left (\frac{1}{n}\right )^{2}\mbox{Var}(Y_{i})=\frac{\sigma^{2}}{n}.\]
The estimator \(\widehat{\beta}\) is also a linear function of \(Y_{1},\cdots ,Y_{n}\) and hence has a normal distribution with mean
\begin{align*}
\mathbb{E}(\widehat{\beta}) & =\frac{\sum_{i=1}^{n} (x_{i}-\bar{x})\mathbb{E}(Y_{i})}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\\ & =\sum_{i=1}^{n}\frac{\sum_{i=1}^{n}(x_{i}-\bar{x})(\alpha +\beta (x_{i}-\bar{x}))}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\\
& =\frac{\alpha\sum_{i=1}^{n} (x_{i}-\bar{x})+\beta\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}=\beta
\end{align*}
and variance
\begin{align*} \mbox{Var}(\widehat{\beta}) & =\sum_{i=1}^{n}\left [\frac{x_{i}-\bar{x}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\right ]^{2}\cdot \mbox{Var}(Y_{i})\\ & =\frac{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}{\left [\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}\right ]^{2}}\cdot\sigma^{2}\\ & =\frac{\sigma^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}.\end{align*}
It can be shown that
\begin{align}
\sum_{i=1}^{n} (Y_{i}-\alpha -\beta (x_{i}-\bar{x}))^{2} & =\sum_{i=1}^{n} [(\widehat{\alpha}-\alpha )+(\widehat{\beta}-\beta )(x_{i}-\bar{x})+(Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x}))]^{2}\nonumber\\
& =n(\widehat{\alpha}-\alpha )^{2}+(\widehat{\beta}-\beta )^{2}
\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}+\sum_{i=1}^{n} (Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x}))^{2}.\label{ch5eq3}\tag{3}
\end{align}
From the fact that \(Y_{i}\), \(\widehat{\alpha}\) and \(\widehat{\beta}\) have normal distributions, we know that each of
\[\frac{(Y_{i}-\alpha -\beta (x_{i}-\bar{x}))^{2}}{\sigma^{2}},\quad\frac{(\widehat{\alpha}-\alpha )^{2}}{\sigma^{2}/n}\mbox{ and }
\frac{(\widehat{\beta}-\beta )^{2}}{\sigma^{2}/\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\]
has a chi-square distribution with one degree of freedom. Since \(Y_{1},\cdots ,Y_{n}\) are mutually independent, then
\[\frac{1}{\sigma^{2}}\sum_{i=1}^{n} (Y_{i}-\alpha -\beta (x_{i}-\bar{x}))^{2}\]
is \(\chi^{2}(n)\). That is, the left-hand member of (\ref{ch5eq3}) divided by \(\sigma^{2}\) is \(\chi^{2}(n)\) and is equal to the sum of two \(\chi^{2}(1)\) random variables and
\[\frac{1}{\sigma^{2}}\sum_{i=1}^{n} (Y_{i}-\widehat{\alpha}-\widehat{\beta}
(x_{i}-\bar{x}))^{2}=\frac{n\widehat{\sigma^{2}}}{\sigma^{2}}\geq 0.\]
Therefore, we might guess that \(n\widehat{\sigma^{2}}/\sigma^{2}\) is \(\chi^{2}(n-2)\). This is true. Moreover, \(\widehat{\alpha}\), \(\widehat{\beta}\) and \(\widehat{\sigma^{2}}\) are mutually independent.
Suppose now that we are interested in forming a confidence interval for \(\beta\), the slope of the line. We can use the fact that
\[T_{1}=\frac{\sqrt{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\left (\frac{\widehat{\beta}-\beta}{\sigma}\right )}{\sqrt{n\widehat{\sigma^{2}}/\sigma^{2}(n-2)}}=\frac{\widehat{\beta}-\beta}{\sqrt{
\frac{n\widehat{\sigma^{2}}}{(n-2)\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}\]
has a \(t\)-distribution with \(n-2\) degrees of freedom. Therefore, we have
\[\mathbb{P}\left\{-t_{\alpha /2}(n-2)\leq\frac{\widehat{\beta}-\beta}{\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}\leq t_{\alpha /2}(n-2)\right\}=1-\alpha .\]
It follows
\[\left [\widehat{\beta}-t_{\alpha /2}(n-2)\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}},
\widehat{\beta}+t_{\alpha /2}(n-2)\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}\right ]\]
is a \(100(1-\alpha )\%\) confidence interval for \(\beta\). Similarly,
\[T_{2}=\frac{\sqrt{n}(\widehat{\alpha}-\alpha )/\sigma^{2}}{\sqrt{\frac{n\widehat{\sigma^{2}}}{\sigma^{2}(n-2)}}}=\frac{\widehat{\alpha}-\alpha}{\sqrt{\widehat{\sigma^{2}}/(n-2)}}\]
has a \(t\)-distribution with \(n-2\) degrees of freedom. Thus \(T_{2}\) can be used to make inferences about \(\alpha\). The fact that \(n\widehat{\sigma^{2}}/\sigma^{2}\) has a chi-square distribution with \(n-2\) degrees of freedom can be used to make inferences about the variance \(\sigma^{2}\).
In the previous discussion, we considered the estimation of the parameters of a very simple regression curve, namely a straight line. In addition to the point and interval estimates, we can perform tests of hypotheses about these parameters. For illustration, with the same model, we could test the hypotheses \(H_{0}:\beta =\beta_{0}\) by using a random variable
\[T_{1}=\frac{\widehat{\beta}-\beta}{\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}.\]
The null hypothesis along with three possible alternative hypotheses are given in the following table
\[\begin{array}{c}
\mbox{Tests about the slope of the regression line}\\
\begin{array}{ccc}
\hline H_{0} & H_{1} & \mbox{Critical Region}\\
\beta =\beta_{0} & \beta >\beta_{0} & t_{1}\geq t_{\alpha}(n-2)\\
\beta =\beta_{0} & \beta <\beta_{0} & t_{1}\leq -t_{\alpha}(n-2)\\
\beta =\beta_{0} & \beta\neq\beta_{0} & |t_{1}|\geq t_{\alpha /2}(n-2)\\
\hline\end{array}\end{array}\]
We often test the hypothesis \(H_{0}:\beta =0\). That is, we test the null hypothesis that the slope is equal to zero.
Example. Let \(x\) equal a student’s preliminary test score in a psychology course and \(y\) the same student’s score on the final examination. With \(n=10\) students, we shall test \(H_{0}:\beta =0\) against \(H_{1}:\beta\neq 0\). At an \(\alpha =0.01\) significance levl, the critical region is \(|t_{1}|\geq t_{0.005}(8)=3.355\). Using the data in Example \ref{ch5ex5}, the observed value of \(T_{1}\) is
\[t_{1}=\frac{0.742-0}{\sqrt{(10\cdot 21.7709)/(8\cdot756.1)}}=\frac{0.742}{0.1897}=3.911.\]
Thus, we reject \(H_{0}\). \(\sharp\)
There is another way of looking at tests of \(H_{0}:\beta =0\) that gives additional insight into determining how much of the variation in the data is explained by the linear model. This approach is similar to that used in the study of analysis of variance. We take the total sum of squares and partition it into the variation due to the linear model and an error sum of squares. Within the total sum of squares, \(SSTO\), we add and subtract the same quantity. Recall that \(\widehat{\alpha}=\bar{Y}\). We have
\begin{align*}
SSTO & =\sum_{i=1}^{n} (Y_{i}-\bar{Y})^{2}\\
& =\sum_{i=1}^{n} (\widehat{\beta}(x_{i}-\bar{x})+Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x})^{2}))^{2}\\
& =\sum_{i=1}^{n} \widehat{\beta}^{2}(x_{i}-\bar{x})^{2}+\sum_{i=1}^{n}(Y_{i}-\widehat{\alpha}-\widehat{\beta}(x_{i}-\bar{x}))^{2}\\
& =SSR+SSE.
\end{align*}
The cross-product term is equal to zero. When \(\beta =0\), we have \(Y_{i}=\alpha +\varepsilon_{i}\). \(SSTO/(n-1)\) is an unbiased estimator of \(\sigma^{2}\), and \(SSTO/\sigma^{2}\) is \(\chi^{2}(n-1)\). Also,
\[\frac{SSR}{\sigma^{2}}=\frac{\widehat{\beta}^{2}\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}{\sigma^{2}}=\left (\frac{\widehat{\beta}-0}{\sigma /\sqrt{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}\right )^{2}.\]
We have noted that \(\widehat{\beta}\) is \(N(\beta ,\sigma^{2}/\sum_{i=1}^{n} (x_{i}-\bar{x})^{2})\). Since \(\beta =0\), we see that \(SSR/\sigma^{2}\) is the square of a standard normal random variable, which says that is \(\chi^{2}(1)\). Since \(SSE\geq 0\), we see that the distribution of \(SSE\) is \(\chi^{2}(n-2)\), and that \(SSR/\sigma^{2}\) and \(SSE/\sigma^{2}\) are independent random variables. Also, if \(\beta =0\), we have
\begin{align*} \mathbb{E}(MSR) & =\mathbb{E}\left (\frac{SSR}{1}\right )\\ & =\mathbb{E}\left (\sigma^{2}\cdot\frac{SSR}{\sigma^{2}}\right )=\sigma^{2}\end{align*}
and
\begin{align*} \mathbb{E}(MSE) & =\mathbb{E}\left (\frac{SSE}{n-2}\right )\\ & =\mathbb{E}\left [\left (\frac{\sigma^{2}}{n-2}\right )\left (\frac{SSE}{\sigma^{2}}\right )\right ]=\sigma^{2}.\end{align*}
Therefore, both \(MSR\) and \(MSE\) provide unbiasd estimates of \(\sigma^{2}\) when \(\beta =0\). \(MSR\) and \(MSE\) are called the mean square due to regression and the mean square due to error. Since \(MSR\) and \(MSE\) are independent chi-square random variables, it follows
\[F=\frac{\frac{SSR/\sigma^{2}}{1}}{\frac{SSE/\sigma^{2}}{n-2}}=\frac{MSR}{MSE}\]
has a \(F(1,n-2)\) distribution under \(H_{0}:\beta =0\). If \(\beta\neq 0\), it is still true that \(MSE=SSE/(n-2)\) is an unbiased estimator of
$\sigma^{2}$. However, \(MSR\) will tend to overestimate \(\sigma^{2}\). Therefore, to test \(H_{0}:\beta =0\) against \(H_{1}:\beta\neq 0\), we can define the critical region by
\[F=\frac{MSR}{MSE}\geq F_{\alpha}(1,n-2).\]
Note that \(T_{1}^{2}=F\) with \(\beta_{0}=0\), which says that \(T_{1}^{2}\) has an \(F(1,n-2)\) distribution under \(H_{0}\). We can summarize this test using an ANOVA table given below
\[\begin{array}{lcccc}
\hline \mbox{Source} & SS & d.f. & MS & F\\
\hline \mbox{Regression} & SSR & 1 & MSR=SSR/1 & MSR/MSE\\
\mbox{Error} & SSE & n-2 & MSE=SSE/(n-2) &\\
\mbox{Total} & SSTO & n-1 &&\\
\hline\end{array}\]
Example. Continued from Example \ref{ch5ex5} by constructing an ANOVA table. We have
\begin{align*}
SSR & =\widehat{\beta}^{2}\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}=\left (\frac{561.1}{756.1}\right )^{2}\cdot 756.1=416.391,\\
SSE & =\sum_{i=1}^{n} (y_{i}-\widehat{y}_{i})^{2}=217.709,\\
SSTO & =\sum_{i=1}^{n} y_{i}^{2}-\frac{1}{10}\cdot\left (\sum_{i=1}^{n} y_{i}\right )^{2}=66731-66096.9=634.1.
\end{align*}
These are summarized in an ANOVA table
\[\begin{array}{lcccc}
\hline \mbox{Source} & SS & d.f. & MS & F\\
\hline \mbox{Regression} & 416.391 & 1 & 416.391 & 15.3008\\
\mbox{Error} & 217.709 & 8 & 27.2136 &\\
\mbox{Total} & 634.1 & 9 &&\\
\hline\end{array}\]
Since \(F=15.3008>11.26=F_{0.01}(1,8)\), the null hypothesis \(H_{0}\) is rejected at the \(\alpha =0.01\) significance level. Note that \(t_{1}^{2}=(3.9111)^{2}=15.2967\approx 15.3006=F\), the difference is due to round-off error. \(\sharp\)
We have noted that
\[\widehat{Y}=\widehat{\alpha}+\widehat{\beta}(x-\bar{x})\]
is a point estimate for the mean of \(Y\) for some given \(x\). Therefore, we could think of this as a prediction of the value of \(Y\) for this given \(x\). But how close is \(\widehat{Y}\) to the mean of \(Y\) or to \(Y\) itself? We shall now find a confidence interval for \(\alpha +\beta (x-\bar{x})\) and a prediction interval for \(Y\), given a particular value of \(x\). To find a confidence interval for
\[\mathbb{E}(Y)=\mu (x)=\alpha +\beta (x-\bar{x}),\]
let
\[\widehat{Y}=\widehat{\alpha}+\widehat{\beta}(x-\bar{x}).\]
Recall that \(\widehat{Y}\) is a linear combination of normally and independently distributed random variables, which says that \(\widehat{Y}\) has a normal distribution. Furthermore, we have
\begin{align*} \mathbb{E}(\widehat{Y}) & =\mathbb{E}(\widehat{\alpha}+\widehat{\beta}(x-\bar{x}))\\ & =\alpha +\beta (x-\bar{x})\end{align*}
and
\begin{align*} \mbox{Var}(\widehat{Y}) & =\mbox{Var}(\widehat{\alpha}+\widehat{\beta}(x-\bar{x}))\\ & =\frac{\sigma^{2}}{n}+\frac{\sigma^{2}(x-\bar{x})^{2}}{\sum_{i=1}^{n}(x_{i}-\bar{x})^{2}}\\ & =\sigma^{2}\cdot\left (\frac{1}{n}+\frac{(x-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\right ).\end{align*}
Recall that the distribution of \(n\widehat{\sigma^{2}}/\sigma^{2}\) is \(\chi^{2}(n-2)\). Since \(\widehat{\alpha}\) and \(\widehat{\beta}\) are independent of \(\widehat{\sigma^{2}}\), we can form the \(t\)-statistic
\[T=\frac{\frac{\widehat{\alpha}+\widehat{\beta}(x-\bar{x})-(\alpha +\beta (x-\bar{x}))}{\sigma\cdot\sqrt{\frac{1}{n}+\frac{(x-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}}{\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sigma^{2}}}}\]
which has a \(t(n-2)\) distribution. Select \(t_{\alpha /2}(n-2)\) satisfying
\[\mathbb{P}\left\{-t_{\alpha /2}(n-2)\leq T\leq t_{\alpha /2}(n-2)\right\}=1-\alpha .\]
This becomes
\[\mathbb{P}\left\{\widehat{\alpha}+\widehat{\beta}(x-\bar{x})-ct_{\alpha /2}(n-2)\leq\alpha +\beta (x-\bar{x})\leq\{\widehat{\alpha}+\widehat{\beta}(x-\bar{x})+ct_{\alpha /2}(n-2)\right\}=1-\alpha ,\]
where
\[c=\sqrt{\frac{n\widehat{\sigma^{2}}}{n-2}}\cdot\sqrt{\frac{1}{n}+\frac{(x-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}.\]
Therefore, the endpoints for a \(100(1-\alpha )\%\) confidence interval for \(\mu (x)=\alpha +\beta (x-\bar{x})\) are
\[\widehat{\alpha}+\widehat{\beta}(x-\bar{x})\pm ct_{\alpha /2}(n-2).\]
Note that the width of this interval depends on the particular values of \(x\), since $c$ depends on \(x\).
We have used \((x_{1},y_{1}),(x_{2},y_{2}),\cdots ,(x_{n},y_{n})\) to estimate \(\alpha\) and \(\beta\). Suppose that we are given a value of \(x\), say \(x_{n+1}\). A point estimate of the corresponding value of \(Y\) is
\[\widehat{y}_{n+1}=\widehat{\alpha}+\widehat{\beta}(x_{n+1}-\bar{x}).\]
However \(\widehat{y}_{n+1}\) is just one possible value of the random variable
\[Y_{n+1}=\alpha +\beta (x_{n+1}-\bar{x})+\varepsilon_{n+1},\]
what can we say about possible values for \(Y_{n+1}\)? We shall now obtain a prediction interval for \(Y_{n+1}\) when \(x=x_{n+1}\) that is similar to the confidence interval for the mean of \(Y\) when \(x=x_{n+1}\). We have
\[Y_{n+1}=\alpha +\beta (x_{n+1}-\bar{x})+\varepsilon_{n+1},\]
where \(\varepsilon_{n+1}\) is \(N(0,\sigma^{2})\) and \(\bar{x}=(1/n)\sum_{i=1}^{n} x_{i}\). Now
\[W=Y_{n+1}-\widehat{\alpha}-\widehat{\beta}(x_{n+1}-\bar{x})\]
is a linear combination of normally and independently distributed random variables, so \(W\) has a normal distribution. The mean of \(W\) is
\begin{align*}
\mbox{Var}(W) & =\mathbb{E}[Y_{n+1}-\widehat{\alpha}-\widehat{\beta}(x_{n+1}-\bar{x})]\\
& =\alpha +\beta (x_{n+1}-\bar{x})-\alpha -\beta (x_{n+1}-\bar{x})=0.
\end{align*}
Since \(Y_{n+1}\), \(\widehat{\alpha}\) and \(\widehat{\beta}\) are independent, the variance of \(W\) is
\begin{align*} \mbox{Var}(W) & =\sigma^{2}+\frac{\sigma^{2}}{n}+\frac{\sigma^{2}(x_{n+1}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\\ & =\sigma^{2}\cdot\left (1+\frac{1}{n}+\frac{(x_{n+1}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}\right ).\end{align*}
Recall that \(n\widehat{\sigma^{2}}/((n-2)\sigma^{2})\) is \(\chi^{2}(n-2)\). Since \(Y_{n+1}\), \(\widehat{\alpha}\) and \(\widehat{\beta}\) are independent of \(\widehat{\sigma^{2}}\), we can form the \(t\)-statistic
\[T=\frac{\frac{Y_{n+1}-\widehat{\alpha}-\widehat{\beta}(x-\bar{x})}{\sigma\cdot\sqrt{1+\frac{1}{n}+\frac{(x_{n+1}-\bar{x})^{2}}
{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}}}{\sqrt{\frac{n\widehat{\sigma^{2}}}{(n-2)\sigma^{2}}}}\]
which has a \(t(n-2)\) distribution. Select a constant \(t_{\alpha /2}(n-2)\) satisfying
\[\mathbb{P}\left\{-t_{\alpha /2}(n-2)\leq T\leq t_{\alpha /2}(n-2)\right\}=1-\alpha .\]
Solving this inequality for \(Y_{n+1}\), we have
\[\mathbb{P}\left\{\widehat{\alpha}+\widehat{\beta}(x_{n+1}-\bar{x})-dt_{\alpha /2}(n-2)\leq Y_{n+1}\leq\{\widehat{\alpha}+\widehat{\beta}(x_{n+1}-\bar{x})+dt_{\alpha /2}(n-2)\right\}=1-\alpha ,\]
where
\[d=\sqrt{\frac{n\widehat{\sigma^{2}}}{n-2}}\cdot\sqrt{1+\frac{1}{n}+\frac{(x_{n+1}-\bar{x})^{2}}{\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}}}.\]
Thus the endpoints for a \(100(1-\alpha )\%\) confidence interval for \(Y_{n+1}\) are
\[\widehat{\alpha}+\widehat{\beta}(x_{n+1}-\bar{x})\pm dt_{\alpha /2}(n-2).\]
Example. To find a \(95\%\) confidence interval for \(\mu (x)\) using the data in Exampel \ref{ch5ex5}, note that we have already found that \(\bar{x}=68.3\), \(\widehat{\alpha}=81.3\), \(\widehat{\beta}=561.1/756.1=0.7421\), and \(\widehat{\sigma^{2}}=21.7709\). We also need
\[\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}=\sum_{i=1}^{n} x_{i}^{2}-\frac{1}{n}\cdot
\left (\sum_{i=1}^{n} x_{i}\right )^{2}=47405-\frac{683^{2}}{10}=756.1\]
For \(95\%\) confidence, we have \(t_{0.025}(8)=2.306\). When \(x=60\), the endpoints for a confidence interval for \(\mu (60)\) are
\[81.3+0.742(60-68.3)\pm 2.306\cdot\left (\sqrt{\frac{10\cdot 21.7709}{8}}
\cdot\sqrt{\frac{1}{10}+\frac{(60-68.3)^{2}}{756.1}}\right )\]
or \(75.1406\pm 5.2589\). Similarly, when \(x=70\) the endpoints for a \(95\%\) confidence interval for \(\mu (70)\) are \(82.5616\pm 3.8761\).
Note that the lengths of these intervals depends on the particular value of \(x\). The endpoints for a \(95\%\) prediction interval for \(Y\) when \(x=60\) are
\[81.3+0.742(60-68.3)\pm 2.306\cdot\left (\sqrt{\frac{10\cdot 21.7709}{8}}
\cdot\sqrt{1.1+\frac{(60-68.3)^{2}}{756.1}}\right )\]
or \(75.1406\pm 13.1289\). Note that this interval is much wider than that for \(\mu (60)\).
We consider a basic regression model where there is only one independent variable and the regression function is linear. The model can be expressed as follows:
\begin{equation}{\label{let1e1}}\tag{4}
Y_{i}=\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i},
\end{equation}
for \(i=1,\cdots ,n\).
- \(Y_{i}\) is the value of the response variable in the \(i\)th trial.
- \(\beta_{0}\) and \(\beta_{1}\) are parameters.
- \(X_{i}\) is a known constant, i.e., the value of the independent variable in the \(i\)th trial.
- \(\varepsilon_{i}\) is a random error term with mean \(\mathbb{E}(\varepsilon_{i})=0\) and variance \(\mathbb{V}(\varepsilon_{i})=\sigma^{2}\).
- \(\varepsilon_{i}\) and \(\varepsilon_{j}\) are uncorrelated such that the covariance \(\mbox{Cov}(\varepsilon_{i},\varepsilon_{j})=0\) for all \(i\neq j\).
The features of this model is described below.
(i) The observed value of \(Y\) in the \(i\)th trial is the sum of two components, i.e., the constant term \(\beta_{0}+\beta_{1}X_{i}\) and the random term \(\varepsilon_{i}\). Therefore \(Y_{i}\) is a random variable.
(ii) Since \(\mathbb{E}(\varepsilon_{i})=0\), we have
\begin{align*} \mathbb{E}(Y_{i}) & =\mathbb{E}(\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i})\\ & =\beta_{0}+\beta_{1}X_{i}+\mathbb{E}(\varepsilon_{i})\\ & =\beta_{0}+\beta_{1}X_{i}.\end{align*}
Therefore, the regression function for model (\ref{let1e1}) is
\[\mathbb{E}(Y)=\beta_{0}+\beta_{1}X.\]
(iii) The observed value of \(Y\) in the \(i\)th trial exceeds or falls short of the value of the regression function by the error term amount \(\varepsilon_{i}\).
(iv) Since \(\mathbb{V}(\varepsilon_{i})=\sigma^{2}\), we have
\begin{align*} \mathbb{V}(Y_{i}) & =\mathbb{V}(\beta_{0}+\beta_{1}X_{i}+\varepsilon_{i})\\ & =\mathbb{V}(\varepsilon_{i})=\sigma^{2}.\end{align*}
(v) Since the error terms \(\varepsilon_{i}\) and \(\varepsilon_{j}\) are uncorrelated, the responses \(Y_{i}\) and \(Y_{j}\) are also uncorrelated.
Ordinarily, we do not know the values of the regression parameters \(\beta_{0}\) and \(\beta_{1}\) in regression model (\ref{let1e1}). We need to estimate them from relevant data using the method of least squares. For each sample observation \((X_{i},Y_{i})\), the method of least squares considers the deviation of \(Y_{i}\) from its expected value
\[Y_{i}-(\beta_{0}+\beta_{1}X_{i}).\]
We consider the sum of the \(n\) squared deviations
\[Q=\sum_{i=1}^{n} (Y_{i}-\beta_{0}-\beta_{1}X_{i})^{2}.\]
The purpose is to minimize \(Q\) for the given sample observations \((X_{i},Y_{i})\). The estimators of \(\beta_{0}\) and \(\beta_{1}\) are the values of \(b_{0}\) and \(b_{1}\) which minimize \(Q\). We set the following partial derivatives to be zero
\begin{align*}
\frac{\partial Q}{\partial\beta_{0}} & =-2\sum_{i=1}^{n}(Y_{i}-\beta_{0}-\beta_{1}X_{i})=0\\
\frac{\partial Q}{\partial\beta_{1}} & =-2\sum_{i=1}^{n}X_{i} (Y_{i}-\beta_{0}-\beta_{1}X_{i})=0
\end{align*}
We obtain
\begin{align*}
b_{1} & =\frac{\sum X_{i}Y_{i}-\frac{\sum X_{i}\sum Y_{i}}{n}} {\sum X_{i}^{2}-\frac{(\sum X_{i})^{2}}{n}}\\ & =\frac{\sum (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum (X_{i}-\bar{X})^{2}}\\
b_{0} & =\frac{1}{n}\left (\sum Y_{i}-b_{1}\sum X_{i}\right )=\bar{Y}-b_{1}\bar{X},
\end{align*}
where \(\bar{X}=\frac{1}{n}\sum X_{i}\) and \(\bar{Y}=\frac{1}{n}\sum Y_{i}\). A test of the second partial derivatives will show that a minimum is obtained with the least squares estimators \(b_{0}\) and \(b_{1}\). The estimated regression function is
\[\hat{Y}=b_{0}+b_{1}X,\]
where \(\hat{Y}\) is the value of the estimated regression function at the level \(X\) of the independent variable.
\begin{equation}{\label{let1ex1}}\tag{5}\mbox{}\end{equation}
Example \ref{let1ex1}. We consider the following data
\[\begin{array}{ccc}
\hline \mbox{Trial} & \mbox{Advertising Expenditures} & \mbox{Sales}\\
i & X_{i} & Y_{i}\\
\hline 1 & 30 & 73\\
2 & 20 & 50\\
3 & 60 & 128\\
4 & 80 & 170\\
5 & 40 & 87\\
6 & 50 & 108\\
7 & 6$ & 135\\
8 & 30 & 69\\
9 & 70 & 148\\
10 & 60 & 132\\
\hline\end{array}\]
We have \(\sum Y_{i}=1100\), \(\sum X_{i}=500\), \(\sum X_{i}Y_{i}=61800\) and \(\sum X_{i}^{2}=28400\). Therefore, we obtain \(b_{1}=2\) and \(b_{0}=10\). The estimated regression function is \(\hat{Y}=10+2X\). If \(X=55\) then the point estimate is \(\hat{Y}=10+2*55=120\). Therefore, the estimate of the mean number of sale for many advertising expenditures is \(120\). This means that if many advertising expenditures are performed, the mean sale for these many performances is about \(120\). The sale for any one of the perfromances is likely to fall above or below the mean response \(120\) because of inherent variability in the system, as represented by the error term in the model. \(\sharp\)
The \(i\)th residual is the difference between the observed value \(Y_{i}\) and the corresponding fitted value \(\hat{Y}_{i}\). Denoting this residual by \(e_{i}\), we can write
\[e_{i}=Y_{i}-\hat{Y}_{i}=Y{i}-b_{0}-b_{1}X_{i}.\]
We need to distinguish between the model error term \(\varepsilon_{i}=Y_{i}-\mathbb{E}(Y_{i})\) and the residual \(e_{i}=Y_{i}-\hat{Y}_{i}\). Then we have the following properties.
(i) The sum of the residuals is zero, i.e., \(\sum e_{i}=0\).
(ii) The sum of the observed values \(Y_{i}\) equals to the sum of the fitted values \(\hat{Y}_{i}\), i.e.,
\begin{equation}{\label{let1e10}}\tag{6}
\sum_{i=1}^{n} Y_{i}=\sum_{i=1}^{n} \hat{Y}_{i}.
\end{equation}
It follows that the mean of the \(\hat{Y}_{i}\) is the same as the mean of the \(Y_{i}\), namely, \(\bar{Y}\).
(iii) The sum of the weighted residuals is zero when the residual in the \(i\)th trial is weighted by the level of the independent variable in the \(i\)th trial, i.e., \(\sum_{i=1}^{n} X_{i}e_{i}=0.\)
(iv) The sum of the weighted residuals is zero when the residual in the \(i\)th trial is weighted by the fitted value of the response variable for the \(i\)th trial, i.e., \(\sum_{i=1}^{n} \hat{Y}_{i}e_{i}=0.\)
(v) The regression line always goes through the point \((\bar{X},\bar{Y})\).
The variance \(\sigma^{2}\) of the error terms \(\varepsilon_{i}\) in the regression model needs to be estimated for a variety of purposes. Frequently, we would like to obtain an indication of the variability of the probability distributions of \(Y\). We define the error sum of squares or residual sum of squares \(SSE\) by
\[SSE=\sum_{i=1}^{n} e_{i}^{2}=\sum_{i=1}^{n} (Y_{i}-\hat{Y}_{i})^{2}= \sum_{i=1}^{n} (Y_{i}-b_{0}-b_{1}X_{i})^{2}.\]
The error sum of squares \(SSE\) has \(n-2\) degrees of freedom associated with it. Two degrees of freedom are lost because both \(\beta_{0}\) and \(\beta_{1}\) have to be estimated in obtaining the estimated means \(\hat{Y}_{i}\). Hence, the error mean square or residual mean square, denoted by \(MSE\), is defined by
\begin{align*} MSE & =\frac{SSE}{n-1}\\ & =\frac{\sum_{i=1}^{n} e_{i}^{2}}{n-2}\\ & =\frac{\sum_{i=1}^{n} (Y_{i}-\hat{Y}_{i})^{2}}{n-2}\\ & =\frac {\sum_{i=1}^{n} (Y_{i}-b_{0}-b_{1}X_{i})^{2}}{n-2}.\end{align*}
It can be shown that \(MSE\) is an unbiased estimator of \(\sigma^{2}\) for the regression model with \(\mathbb{E}(MSE)=\sigma^{2}\). An estimator of the standard deviation \(\sigma\) is simply the square root of \(MSE\).
There are three alternative computational formulas for \(SSE\) described below:
\begin{equation}{\label{let1e2}}\tag{7}
SSE=\sum Y_{i}^{2}-b_{0}\sum Y_{i}-b_{1}\sum X_{i}Y_{i}
\end{equation}
\begin{equation}{\label{let1e3}}\tag{8}
SSE=\sum (Y_{i}-\bar{Y})^{2}-\frac{\left [\sum (X_{i}-\bar{X})(Y_{i}-\bar{Y})\right ]^{2}}{\sum (X_{i}-\bar{X})^{2}}
\end{equation}
\begin{equation}{\label{let1e4}}\tag{9}
SSE=\left [\sum Y_{i}^{2}-\frac{\left (\sum Y_{i}\right )^{2}}{n}\right ]-\frac{\left (\sum X_{i}Y_{i}-{\displaystyle \frac{\sum X_{i}\sum Y_{i}}{n}}\right )^{2}}{\sum X_{i}^{2}-{\displaystyle \frac{\left (\sum X_{i}\right )^{2}}{n}}}.
\end{equation}
Formula (\ref{let1e2}) is useful when \(b_{0}\) and \(b_{2}\) have already been calculated. Otherwise (\ref{let1e3}) and (\ref{let1e4}) are more direct.
Returning to Example \ref{let1ex1}, we obtain \(SSE=60\) and \(MSE=60/8=7.5\). The point estimate of \(\sigma\) is \(\sqrt{7.5}=2.74\).
For the following discussions, we assume that the error term \(\varepsilon_{i}\) is normally distributed, i.e. \(\varepsilon_{i}\) are independent \(N(0,\sigma^{2})\). Then \(Y_{i}\) are independent normal random variables with mean \(\mathbb{E}(Y_{i})=\beta_{0}+\beta_{1}X_{i}\) and variance \(\sigma^{2}\). It can be shown that \(b_{0}\) and \(b_{1}\) are unbiased estimators, i.e. \(\mathbb{E}(b_{0})=\beta_{0}\) and \(\mathbb{E}(b_{1})=\beta_{1}\). We also have
\begin{align*} V(b_{0}) & =\frac{\sigma^{2}\sum X_{i}^{2}}{n\sum(X_{i}-\bar{X})^{2}}\\ &=\sigma^{2}\left [\frac{1}{n}+\frac{\bar{X}^{2}}{\sum (X_{i}-\bar{X})^{2}}\right ]\end{align*}
and
\[V(b_{1})=\frac{\sigma^{2}}{\sum (X_{i}-\bar{X})^{2}}.\]
The estimator \(b_{1}\) can be rewritten as \(b_{1}=\sum k_{i}Y_{i}\), where
\[k_{i}=\frac{X_{i}-\bar{X}}{\sum (X_{i}-\bar{X})^{2}}.\]
We see that the \(k_{i}\) are constant, since the \(X_{i}\) are fixed. Therefore \(b_{1}\) is a linear combination of the \(Y_{i}\), where the coefficients are solely a function of the fixed \(X_{i}\). It can be shown that \(b_{1}\) has minimum variance among all unbiased linear estimators of the form \(\hat{\beta}_{1}=\sum c_{i}Y_{i}\), where the \(c_{i}\) are arbitrary constants. Since the \(Y_{i}\) are independently and normally distributed, it follows that \(b_{1}\), which is a linear combination of independent normal random variables, is normally distributed. Since
\[\mathbb{V}(b_{1})=\frac{\sigma^{2}}{\sum (X_{i}-\bar{X})^{2}},\]
we can estimate the variance of the sampling distribution of \(b_{1}\) by replacing the parameter \(\sigma^{2}\) with the unbiased estimator \(MSE\). That is to say, the point estimator
\[\hat{\sigma}^{2}(b_{1})=\frac{MSE}{\sum (X_{i}-\bar{X})^{2}}\]
is an unbiased estimator of \(\mathbb{V}(b_{1})\).
The point estimator \(b_{0}\) is given \(b_{0}=\bar{Y}-b_{1}\bar{X}\). The sampling distribution of \(b_{0}\) is normal with mean and variance given by \(\mathbb{E}(b_{0})=\beta_{0}\) and
\begin{align*} \mathbb{V}(b_{0}) & =\frac{\sigma^{2}\sum X_{i}^{2}}{n\sum (X_{i}-\bar{X})^{2}}\\ & =\sigma^{2}\left [\frac{1}{n}+\frac{\bar{X}^{2}}{\sum (X_{i}-\bar{X})^{2}}\right ]\end{align*}
The normality of the sampling distribution of \(b_{0}\) follows, since \(b_{0}\) is also a linear combination of the observations \(Y_{i}\). An estimator of \(V(b_{0})\) is obtained by replacing \(\sigma^{2}\) by its point estimator \(MSE\)
\begin{align*} \hat{\sigma}^{2}(b_{0}) & =\frac{MSE\sum X_{i}^{2}}{n\sum(X_{i}-\bar{X})^{2}}\\ & =MSE\left [\frac{1}{n}+\frac{\bar{X}^{2}}{\sum (X_{i}-\bar{X})^{2}}\right ].\end{align*}
Since the probability distribution of the error term is specified, estimators of the parameters \(\beta_{0}\), \(\beta_{1}\) and \(\sigma^{2}\) can be obtained by the method of maximum
likelihood. The likelihood function is given by
\begin{align*} L(\beta_{0},\beta_{1},\sigma^{2}) & =\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma^{2}}}\exp\left [-\frac{1}{2\sigma^{2}}(Y_{i}-\beta_{0}-\beta_{1}X_{i})^{2}\right ]\\
& =\frac{1}{(2\pi\sigma^{2})^{n/2}}\exp\left [-\frac{1}{2\sigma^{2}}\sum_{i=1}^{n} (Y_{i}-\beta_{0}-\beta_{1}X_{i})^{2}\right ].\end{align*}
The values of \(\beta_{0}\), \(\beta_{1}\) and \(\sigma^{2}\) that maximize the likelihood function \(L(\beta_{0},\beta_{1},\sigma^{2})\) are the maximum likelihood estimators (MLE). These are given by
\[\begin{array}{cc}
\hline \mbox{Parameter} & \mbox{Maximum Likelihood Estimator}\\
\hline \beta_{0} & b_{0}\\
\beta_{1} & b_{1}\\
\sigma^{2} & {\displaystyle \hat{\sigma}^{2}=\frac{\sum (Y_{i}-\hat{Y}_{i})^{2}}{n}}\\
\hline\end{array}\]
Therefore, the MLE of \(\beta_{0}\) and \(\beta_{1}\) are the same estimators as provided by the method of least squares. The MLE \(\hat{\sigma}^{2}\) is biased. Therefore, the unbiased estimator \(MSE\) is used. However we have the following relation between \(MSE\) and \(\hat{\sigma}^{2}\):
\[MSE=\frac{n}{n-2}\hat{\sigma}^{2}.\]
It can be shown that the \(\hat{Y}\) is also an unbiased estimator of \(\mathbb{E}(Y)\) with the minimum variance in the class of unbiased linear estimators.
Since \(b_{1}\) is normally distributed, we know that the standardized statistic \((b_{i}-\beta_{1})/\sqrt{V(b_{1})}\) is a standard normal random variable. Usually, we need to estimate \(\sqrt{\mathbb{V}(b_{1})}\) by \(\hat{\sigma} (b_{1})\). Therefore, we have to know the distribution of \(b_{1}-\beta_{1}/\hat{\sigma}(b_{1})\). Let us rewrite \(b_{1}-\beta_{1}/\hat{\sigma}(b_{1})\) as follows
\[\frac{(b_{i}-\beta_{1})}{\sqrt{\mathbb{V}(b_{1})}}/\frac{\hat{\sigma}(b_{1})}{\sqrt{\mathbb{V}b_{1})}}.\]
The numerator is a standard normal random variable \(Z\). The denominator can be seen by considering
\begin{align*} \frac{\hat{\sigma}^{2}(b_{1})}{V(b_{1})} & =\frac{MSE/\sum (X_{i}-\bar{X})^{2}}{\sigma^{2}}{\sum (X_{i}-\bar{X})^{2}}\\ & =\frac{MSE}{\sigma^{2}}\\ & =\frac{SSE}
{\sigma^{2}(n-2)}.\end{align*}
We know that \(SSE/\sigma^{2}\) is distributed as \(\chi_{n-2}^{2}\), and is independent of \(b_{0}\) and \(b_{1}\). Then \(Z\) and \(\chi_{n-2}^{2}\) are independent, since \(Z\) is a function of \(b_{0}\) and \(b_{1}\) only. We conclude that \(b_{1}-\beta_{1}/\hat{\sigma}(b_{1})\) is a \(t\)-distribution with \(n-2\) degrees of freedom. (Two parameters \(\beta_{0}\) and \(\beta_{1}\) need to be estimated for the regression model, so two degrees of freedom are lost here.)
Since \(b_{1}-\beta_{1}/\hat{\sigma}(b_{1})\) follows a \(t\)-distribution, we can find the confidence interval for \(\beta_{1}\). We have
\[\mathbb{P}\left (t_{\alpha /2;n-2}\leq b_{1}-\beta_{1}/\hat{\sigma}(b_{1})\leq t_{1-\alpha /2;n-2}\right )=1-\alpha.\]
Because of the symmetry of the \(t\)-distribution, it follows \(t_{\alpha /2;n-2}=-t_{1-\alpha /2;n-2}\). Then, we obtain
\[\mathbb{P}\left (b_{1}-t_{1-\alpha /2;n-2}\hat{\sigma}(b_{1})\leq\beta_{1}\leq b_{1}+t_{1-\alpha /2;n-2}\hat{\sigma}(b_{1})\right )=1-\alpha.\]
The \(1-\alpha\) confidence interval for \(\beta_{1}\) is given by
\[b_{1}\pm t_{1-\alpha /2;n-2}\hat{\sigma}(b_{1}).\]
Similarly \(b_{0}-\beta_{0}/\hat{\sigma}(b_{0})\) is also distributed as \(t_{n-2}\). The \(1-\alpha\) confidence interval for \(\beta_{0}\) is given by
\[b_{0}\pm t_{1-\alpha /2;n-2}\hat{\sigma}(b_{0}).\]
We consider the following hypotheses testing
\[\begin{array}{l}
H_{0}:\beta_{1}=0\\ H_{a}:\beta_{1}\neq 0
\end{array}\]
The test statistic is given by
\[t^{*}=\frac{b_{1}}{\hat{\sigma}(b_{1})}.\]
The decision rule with this test statistic at the level of significance \(\alpha\) is given by
\[\begin{array}{l}
\mbox{If }|t^{*}|\leq t_{1-\alpha/2;n-2},\mbox{ we conclude }H_{0}\\ \mbox{If }|t^{*}|>t_{1-\alpha/2;n-2},\mbox{ we conclude }H_{a}.
\end{array}\]
Example. Continued from Example \ref{let1ex1} with \(\alpha =0.05\), where \(b_{1}=2\) and \(\hat{\sigma}(b_{1})=0.04697\), we require \(t_{0.975;8}=2.306\). Then, the decision rule is given by
\[\begin{array}{l}
\mbox{If }|t^{*}|\leq 2.306,\mbox{ we conclude }H_{0}\\ \mbox{If }|t^{*}|>2.306,\mbox{ we conclude }H_{a}.
\end{array}\]
Since \(|t^{*}|=|2/0.04697|=42.58>2.306\), we conclude \(H_{a}\), i.e., \(\beta_{1}\neq 0\), which says that there is a linear relation between the advertising expenditures and sales. \(\sharp\)
Now, we consider the following hypotheses testing
\[\begin{array}{l}
H_{0}:\beta_{1}\leq 0\\ H_{a}:\beta_{1}>0
\end{array}\]
The decision rule based on the above statistic \(t^{*}\) is given by
\[\begin{array}{l}
\mbox{If }t^{*}\leq t_{1-\alpha ;n-2},\mbox{ we conclude }H_{0}\\ \mbox{If }t^{*}>t_{1-\alpha ;n-2},\mbox{ we conclude }H_{a}.
\end{array}\]
For \(\alpha =0.05\), we require \(t_{0.95;8}=1.860\). Since \(t^{*}=42.85>1.860\), we would conclude that \(\beta_{1}\) is positive.
Since \(\mathbb{E}(Y)=\beta_{0}+\beta_{1}X\), the point estimator of \(\mathbb{E}(Y)\) (the mean response) is \(\hat{Y}=b_{0}+b_{1}X\). When \(X=X_{h}\) is given, the mean response is \(\mathbb{E}(Y_{h})=\beta_{0}+\beta_{1}X_{h}\). Therefore, the point estimator of \(\mathbb{E}(Y_{h})\) is \(\hat{Y}_{h}=b{0}+b_{1}X_{h}\). Since \(\hat{Y}_{h}\) is a linear combination of the observations \(Y_{i}\), the sampling distribution of \(\hat{Y}_{h}\) is normal with mean
\begin{align*} \mathbb{E}(\hat{Y}_{h}) & =\mathbb{E}(b{0}+b_{1}X_{h})\\ & =\beta_{0}+\beta_{1}X_{h}\\ & =\mathbb{E}(Y_{h})\end{align*}
and variance
\[\mathbb{V}(\hat{Y}_{h})=\sigma^{2}\left [\frac{1}{n}+\frac{(X_{h}-\bar{X})^{2}}{\sum (X_{i}-\bar{X})^{2}}\right ].\]
When \(MSE\) is substituted for \(\sigma^{2}\), we obtain \(\hat{\sigma}^{2} (\hat{Y}_{h})\). The estimated variance of \(\hat{Y}_{h}\) is given by
\[\hat{\sigma}^{2}(\hat{Y}_{h})=MSE\left [\frac{1}{n}+\frac{(X_{h}-\bar{X})^{2}}{\sum (X_{i}-\bar{X})^{2}}\right ].\]
It can be shown that
\[\frac{\hat{Y}_{h}-\mathbb{E}(Y_{h})}{\hat{\sigma}(\hat{Y}_{h})}\]
is distributed as \(t_{n-2}\). Then, the \(1-\alpha\) confidence interval for \(\mathbb{E}(Y_{h})\) is given by
\[\hat{Y}_{h}\pm t{1-\alpha /2;n-2}\hat{\sigma}(\hat{Y}_{h}).\]
Example. Returning to Example \ref{let1ex1}, let us find a \(90\%\) confidence interval for \(\mathbb{E}(Y_{h})\) when the advertising expenditure is \(X_{h}=55\). The point estimate of \(\hat{Y}_{h}\) is \(\hat{Y}_{h}=10+2*55=120\). Since
\[\hat{\sigma}^{2}(\hat{Y}_{h})=7.5\left [\frac{1}{10}+\frac{(55-50)^{2}}{3400}\right ]=0.80515,\]
the estimated standard deviation \(\hat{\sigma}(\hat{Y}_{h})=0.8973\). For a \(90\%\) confidence interval, we require \(t_{0.95;8}=1.86\). The confidence interval is given by
\[120-1.86*0.8973\leq\mathbb{E}(Y_{h})\leq 120+1.86*0.8973,\mbox{ i.e., }118.3\leq\mathbb{E}(Y_{h})\leq 121.7.\]
We conclude, with confidence coefficients \(0.9\), that the mean number of sales, when the advertising expenditure \(X_{h}=55\), is somewhere between \(118.3\) and \(121.7\). \(\sharp\)
Next, we are going to use the analysis of variance (ANOVA) approach to regression analysis. The total sum of squares, denoted by \(SSTO\), is defined by
\[SSTO=\sum (Y_{i}-\bar{Y})^{2},\]
and the regression sum of squares, denoted by \(SSR\), is defined by
\[SSR=\sum (\hat{Y}_{i}-\bar{Y})^{2}.\]
The deviation \(\hat{Y}_{i}-\bar{Y}\) is simply the difference between the fitted value and the mean of the fitted values \(\bar{Y}\). (Note that, from (\ref{let1e10}), the mean of the fitted values \(\hat{Y}_{i}\) is \(\bar{Y}\).) Since
\[Y_{i}-\bar{Y}=(\hat{Y}_{i}-\bar{Y})+(Y_{i}-\hat{Y}_{i}),\]
it can be shown
\[\sum (Y_{i}-\bar{Y})^{2}=\sum (\hat{Y}_{i}-\bar{Y})^{2}+ \sum (Y_{i}-\hat{Y}_{i})^{2},\]
which shows
\[SSTO=SSR+SSE.\]
The equivalent computational formulas are given by
\begin{align*} SSTO & =\sum Y_{i}^{2}-\frac{\left (\sum Y_{i}\right )^{2}}{n}\\ & =\sum Y_{i}^{2}-n\bar{Y}^{2}\end{align*}
and
\begin{align*} SSR & =b_{1}\left (\sum X_{i}Y_{i}-\frac{\sum X_{i}Y_{i}}{n}\right )\\
& =b_{1}\left [\sum (X_{i}-\bar{X})(Y_{i}-\bar{Y})\right ]\\
& =b_{1}^{2}\sum (X_{i}-\bar{X})^{2}.\end{align*}
We have \(n-1\) degrees of freedom associated with \(SSTO\). One degree of freedom is lost because the deviations \(Y_{i}-\bar{Y}\) are not independent in that they must sum to zero. Equivalently, one degree of freedom is lost because the sample mean \(\bar{Y}\) is used to estimate the population mean. Recall that \(SSE\) has \(n-2\) degrees of freedom associated with it. Two degrees of freedom are lost because the two parameters \(\beta_{0}\) and \(\beta_{1}\) were estimated in obtaining the fitted values \(\hat{Y}_{i}\). We also see that \(SSR\) has one degree of freedom associated with it. There are two parameters in the regression equation, but the deviations \(\hat{Y}_{i}-\bar{Y}\) are not independent because they must sum to zero. Therefore, one degree of freedom is lost. A sum of squares divided by its associated degrees of freedom is called a mean square (abbreviated \(MS\)). For instance, an ordinary sample variance \(\sum (X_{i}-\bar{X})^{2}/(n-1)\) is a mean square, since a sum of squares \(\sum (X_{i}-\bar{X})^{2}\) is divided by its associated degrees of freedom \(n-1\). We are interested in the regression mean square, denoted by \(MSR\), which is defined by
\[MSR=\frac{SSR}{1}=SSR.\]
and in the error mean square, denoted by \(MSE\), which is defined by
\[MSE=\frac{SSE}{n-2}.\]
Then we have the following ANOVA table for simple linear regression
\[\begin{array}{lcccc}\hline
\hline \mbox{Source of Variation} & SS & df & MS & E(MS)\\
\hline \mbox{Regression} & SSR=\sum (\hat{Y}_{i}-\bar{Y})^{2} & 1 & {\displaystyle MSR=\frac{SSR}{1}} & \sigma^{2}+\beta_{1}^{2}\sum (X_{i}-\bar{X})^{2}\\
\mbox{Error} & SSE=\sum (Y_{i}-\hat{Y}_{i})^{2} & n-2 & {\displaystyle MSE=\frac{SSE}{n-2}} & \sigma^{2}\\ \hline Total & SSTO=\sum (Y_{i}-\bar{Y})^{2} & n-1 &&\\
\hline\end{array}\]
Example. Continued from Example \ref{let1ex1}, we have the following ANOVA table
\[\begin{array}{lccc}\hline
\hline \mbox{Source of Variation} & SS & df & MS\\
\hline \mbox{Regression} & 13600 & 1 & 13600\\
\mbox{Error} & 60 & 8 & 7.5\\
\hline \mbox{Total} & 13660 & 9 &\\
\hline\end{array}\]
Now, we consider the following hypotheses testing
\[\begin{array}{l}
H_{0}:\beta_{1}=0\\ H_{a}:\beta_{1}\neq 0
\end{array}\]
using the ANOVA approach. The test statistic for the ANOVA approach is denoted by \(F^{*}\) which is defined by
\[F^{*}=\frac{MSR}{MSE}.\]
The following theorem is useful.
Theorem. (Cochran’s Theorem). If all \(n\) observations \(Y_{i}\) come from the same normal distribution with mean \(\mu\) and variance \(\sigma^{2}\), and \(SSTO\) is decomposed into \(k\) sums of squares \(SS_{r}\) with degrees of freedom \(df_{r}\), then \(SS_{r}/\sigma^{2}\) are independent \(\chi_{df_{r}}^{2}\) random variables when \(\sum_{r=1}^{k} df_{r}=n-1\). \(\sharp\)
If \(\beta_{1}=0\) such that all \(Y_{i}\) have the same mean \(\mu=\beta_{0}\) and the same variance \(\sigma^{2}\). Since \(SSTO=SSR+SSE\), from the above theorem, we have that \(SSE/\sigma^{2}\) and \(SSR/\sigma^{2}\) are independent \(\chi_{1}^{2}\) and \(\chi_{n-2}^{2}\) random variables. Therefore, when \(H_{0}\) holds, we have
\begin{align*} F^{*} & =\frac{MSR}{MSE}\\ & =\frac{SSR/\sigma^{2}}{1}/\frac{SSE/\sigma^{2}}{n-2}\sim\frac{\chi_{1}^{2}}{1}/\frac{\chi_{n-2}^{2}}{n-2}.\end{align*}
It says that \(F^{*}\) is a \(F_{1,n-2}\)-distribution. The decision rule is as follows when the risk of a Type I error is to be controlled at \(\alpha\).
\[\begin{array}{l}
\mbox{If }F^{*}\leq F_{1-\alpha ;1,n-2},\mbox{ we conclude }H_{0}\\ \mbox{If }F^{*}>F_{1-\alpha ;1,n-2},\mbox{ we conclude }H_{a}.
\end{array}\]
Example. Returning to Example \ref{let1ex1}, for \(\alpha =0.05\), we require \(F_{0.95;1,8}=5.32\). The decision rule is
\[\begin{array}{l}
\mbox{If }F^{*}\leq 5.32,\mbox{ we conclude }H_{0}\\ \mbox{If }F^{*}>5.32,\mbox{ we conclude }H_{a}
\end{array}\]
From the ANOVA table, we have \(MSR=13600\) and \(MSE=7.5\). Therefore, we have \(F^{*}=13600/7.5=1813>5.32\). We conclude \(H_{a}\), i.e., \(\beta_{1}\neq 0\), which says that there is a linear association between advertising expenditures and sales. This is the same result as when the \(t\) test was employed above. \(\sharp\)
\begin{equation}{\label{b}}\tag{B}\mbox{}\end{equation}
General Linear Regression Model.
We now generalize the simple regression model to the multiple regression case. Now, we observe several \(x\)-values \(x_{1},x_{2}, \cdots ,x_{k}\), along with the \(y\)-value. For example, suppose that \(x_{1}\) equals the student’s ACT composite score, \(x_{2}\) equals the student’s high school class rank, and \(y\) equals the student’s first year GPA in college. We want to estimate a regression function \(E(Y)=\mu (x_{1}, x_{2},\cdots ,x{k})\) from some observed data. If
\[\mu (x_{1},x_{2},\cdots ,x_{k})=\beta_{1}x_{1}+\beta_{2}x_{2}+\cdots +\beta_{k}x_{k},\]
then we say that we have a linear model , since this expression is linear in the coefficients \(\beta_{1},\beta_{2},\cdots ,\beta_{k}\). Our \(n\) observation points are
\[(x_{1j},x_{2j},\cdots ,x_{kj},y_{j}), j=1,\cdots ,n.\]
To fit the linear model
\[\beta_{1}x_{1}+\beta_{2}x_{2}+\cdots +\beta_{k}x_{k}\]
by the method of least squares, we minimize
\[G=\sum_{j=1}^{n} (y_{j}-\beta_{1}x_{1j}-\beta_{2}x_{2j}-\cdots -\beta_{k}x_{kj})^{2}.\]
If we equate to zero the \(k\) first partial derivatives
\[\frac{\partial G}{\partial\beta_{i}}=\sum_{j=1}^{n}(-2)(y_{j}-\beta_{1}
x_{ij}-\beta_{2}x_{2j}-\cdots -\beta_{k}x_{kj})(x_{ij}), i=1,\cdots ,k,\]
we obtain the \(k\) normal equations
\begin{align*}
\beta_{1}\sum_{j=1}^{n} x_{1j}^{2}+\beta_{2}\sum_{j=1}^{n} x_{1j}x_{2j}+\cdots +\beta_{k}\sum_{j=1}^{n} x_{1j}x_{kj} & =\sum_{j=1}^{n} x_{1j}y_{j},\\
\beta_{1}\sum_{j=1}^{n} x_{2j}x_{1j}+\beta_{2}\sum_{j=1}^{n} x_{2j}^{2}+\cdots +\beta_{k}\sum_{j=1}^{n} x_{2j}x_{kj} & =\sum_{j=1}^{n} x_{2j}y_{j},\\
\vdot & \vdot\vdot\\
\beta_{1}\sum_{j=1}^{n} x_{kj}x_{1j}+\beta_{2}\sum_{j=1}^{n} x_{kj}x_{2j}+\cdots +\beta_{k}\sum_{j=1}^{n} x_{kj}^{2} & =\sum_{j=1}^{n} x_{kj}y_{j}.
\end{align*}
The solution of these \(k\) equations provides the least squares estimates of \(\beta_{1},\beta_{2},\cdots ,\beta_{k}\), provided that the random variables \(Y_{1},Y_{2},\cdots ,Y_{k}\) are mutually independent and \(Y_{j}\) is
\[N\left (\beta_{1}x_{1j}+\beta_{2}x_{2j}+\cdots +\beta_{k}x_{kj},\sigma^{2}\right )\]
for \(j=1,\cdots ,k\).
Example. By the method of least squares, fit
\[y=\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{x}x_{3}\]
to the observed points \((x_{1},x_{2},x_{3},y)\) given by
\[(1,1,0,4),(1,0,1,3),(1,2,3,2),(1,3,0,6),(1,0,0,1).\]
Note that \(x_{1}=1\) in each point. Therefore, we are really fitting
\[y=\beta_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}.\]
Since
\begin{align*}
&\sum_{j=1}^{5} x_{1j}^{2}=5,\quad\sum_{j=1}^{5} x_{1j}x_{2j}=6,\quad\sum_{j=1}^{5} x_{1j}x_{3j}=4, \quad\sum_{j=1}^{5} x_{1j}y_{j}=16,\\
&\sum_{j=1}^{5} x_{2j}x_{1j}=6,\quad\sum_{j=1}^{5} x_{2j}^{2}=14,\quad\sum_{j=1}^{5} x_{2j}x_{3j}=6, \quad\sum_{j=1}^{5} x_{2j}y_{j}=26,\\
&\sum_{j=1}^{5} x_{3j}x_{1j}=4,\quad\sum_{j=1}^{5} x_{3j}x_{2j}=6,\quad\sum_{j=1}^{5} x_{3j}^{2}=10, \quad\sum_{j=1}^{5} x_{3j}y_{j}=9,
\end{align*}
the normal equations are
\begin{align*}
& 5\beta_{1}+6\beta_{2}+4\beta_{3}=16\\
& 6\beta_{1}+14\beta_{2}+6\beta_{3}\\
& 4\beta_{1}+6\beta_{2}+10\beta_{3}
\end{align*}
Solving these three linear equations in three unknowns, we obtain
\[\widehat{\beta}_{1}=\frac{274}{112},\quad\widehat{\beta}_{2}=\frac{127}{112}\mbox{ and }\widehat{\beta}_{3}=-\frac{85}{112}.\]
Therefore, the least squares fit is
\[y=\frac{1}{112}(274x_{1}+127x_{2}-85x_{3}).\]
If \(x_{1}\) equals \(1\), then the equation reads
\[y=\frac{1}{112}(274+127x_{2}-85x_{3}).\]
It is interesting to observe that the usual two-sample problem is actually a linear model. Let \(\beta_{1}=\mu_{1}\) and \(\beta_{2}=\mu_{2}\) and consider \(n\) pairs of \((x_{1},x_{2})\) that equal \((1,0)\) and \(m\) pairs that equal \((0,1)\). This would require each of the first \(n\) \(Y\)-variables, namely \(Y_{1},Y_{2},\cdots ,Y_{n}\), to have the mean
\[\beta_{1}\cdot 1+\beta_{2}\cdot 0=\beta_{1}=\mu_{1}\]
and the next \(m\) \(Y\) variables, namely \(Y_{n+1},Y_{n+2},\cdots Y_{n+m}\), to have the mean
\[\beta_{1}\cdot 0+\beta_{2}\cdot 1=\beta_{2}=\mu_{2}.\]
This is the background of the two-sample problem with the usual \(X_{1},X_{2},\cdots ,X_{n}\) and \(Y_{1},Y_{2},\cdots ,Y_{m}\) replaced by \(Y_{1},Y_{2},\cdots ,Y_{n}\) and \(Y_{n+1},Y_{n+2},\cdots ,Y_{n+m}\). Clearly, this two-sample problem can be extended to three or more samples. For illustration, let \(\beta_{1}=\mu_{1}\), \(\beta_{2}=\mu_{2}\), and \(\beta_{3}=\mu_{3}\) and have \(n_{1}\) triples \((x_{1},x_{2},x_{3})\) equal to \((1,0,0)\), \(n_{2}\) triples equal to \((0,1,0)\), and \(n_{3}\) triples equal to \((0,0,1)\).
We consider the following general linear regression model
\[Y_{i}=\beta_{0}+\beta_{1}X_{i1}+\beta_{2}X_{i2}+\cdots +\beta_{p-1}X_{i,p-1}+\varepsilon_{i}\]
$i=1,\cdots ,n$.
- \(\beta_{0},\beta_{1},\cdots ,\beta_{p-1}\) are parameters;
- \(X_{i1},\cdots ,X_{i,p-1}\) are known constants;
- \(\varepsilon_{i}\) are independent \(N(0,\sigma^{2})\).
Let \({\bf C}=[Y_{ij}]_{n\times k}\) be an \(n\times k\) matrix whose elements \(Y_{ij}\) are random variables. By assuming that \(\mathbb{E}(Y_{ij})\) are finite, the \(\mathbb{E}({\bf C})\) is defined as \(\mathbb{E}({\bf C})=[\mathbb{E}(Y_{ij})]_{n\times k}\). In particular, for \({\bf Y}=(Y_{1},\cdots ,Y_{n})^{t}\), we have
\[\mathbb{E}({\bf Y})=(\mathbb{E}(Y_{1},\cdots ,\mathbb{E}(Y_{n}))^{t},\]
and for
\[{\bf C}= ({\bf Y}-\mathbb{E}({\bf Y}))({\bf Y}-\mathbb{E}({\bf Y}))^{t},\]
we have
\[\mathbb{E}({\bf C})=\mathbb{E}[({\bf Y}-\mathbb{E}({\bf Y}))({\bf Y}-\mathbb{E}({\bf Y}))^{t}].\]
The covariance matrix of \({\bf Y}\), denoted by \(\Sigma_{\bf Y}\), is defined by \(\Sigma_{\bf Y}=\mathbb{E}({\bf C})\). More precisely, we have
\[\Sigma_{\bf Y}=\left [\begin{array}{cccc} \mathbb{V}(Y_{1}) & \mbox{Cov}(Y_{1},Y_{2}) & \cdots & \mbox{Cov}(Y_{1},Y_{n})\\
\mbox{Cov}(Y_{2},Y_{1}) & \mathbb{V}(Y_{2}) & \cdots & \mbox{Cov}(Y_{2},Y_{n})\\
\vdots & \cdots & &\vdots\\
\mbox{Cov}(Y_{n},Y_{1}) & \mbox{Cov}(Y_{n},Y_{2}) & \cdots & \mathbb{V}(Y_{n})
\end{array}\right ].\]
We define the following matrices
\[{\bf Y}=\left [\begin{array}{c}
Y_{1}\\ Y_{2}\\ \vdots\\ Y_{n} \end{array}\right ],
{\bf X}=\left [\begin{array}{ccccc}
1 & X_{11} & X_{12} & \cdots & X_{1,p-1}\\
1 & X_{21} & X_{22} & \cdots & X_{2,p-1}\\
\vdots & \vdots & \vdots & & \vdots\\
1 & X_{n1} & X_{n2} & \cdots & X_{n,p-1}
\end{array}\right ],
\beta=\left [\begin{array}{c}
\beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p-1} \end{array}\right ]\mbox{ and }
\varepsilon=\left [\begin{array}{c}
\varepsilon_{1}\\ \varepsilon_{2}\\ \vdots\\ \varepsilon_{n} \end{array}\right ].\]
Then, the general linear model can be written in the following compact form
\[{\bf Y}={\bf X}\beta+\varepsilon.\]
\(\varepsilon\) is a vector of independent normal random variables with expectation \(\mathbb{E}(\varepsilon)={\bf 0}\) and covariance \(\Sigma_{\varepsilon}= \sigma^{2}{\bf I}_{n}\), where \({\bf I}\) denotes the \(n\times n\) identity matrix. Consequently, the random vector \({\bf Y}\) has expectation \(\mathbb{E}({\bf Y})={\bf X}\beta\) and the covariance matrix \(\Sigma_{\bf Y}=\sigma^{2}{\bf I}\).
Let us denote the vector of estimated regression coefficients \(b_{0},b_{1}, \cdots ,b_{p-1}\) as \({\bf b}=(b_{0},b_{1},\cdots,b_{p-1})^{t}\). The least squares normal equations for the general linear regression model are
\[{\bf X}^{t}{\bf Xb}={\bf X}^{t}{\bf Y}.\]
and the least squares estimators are
\[{\bf b}=({\bf X}^{t}{\bf X})^{-1}{\bf X}^{t}{\bf Y}.\]
These least squares estimators are also MLE and they are minimum variance unbiased, consistent and sufficient.
Let the vector of the fitted values \(\hat{Y}_{i}\) be denoted by \(\hat{\bf Y}\), and the vector of the residual \(e_{i}=Y_{i}-\hat{Y}_{i}\) be denoted by \({\bf e}\). We have
\[\hat{{\bf Y}}={\bf Xb}={\bf HY},\]
where
\[{\bf H}={\bf X}({\bf X}^{t}{\bf X})^{-1}{\bf X}^{t}\]
and
\[{\bf e}={\bf Y}-\hat{{\bf Y}}={\bf Y}-{\bf Xb}=({\bf I}-{\bf H}){\bf Y}.\]
The covariance matrix of the residuals is given by
\[\Sigma_{{\bf e}}=\sigma ({\bf I}-{\bf H})\]
which is estimated by
\[\widehat{\Sigma}_{{\bf e}}=MSE({\bf I}-{\bf H}).\]
Let
\[{\bf J}=\left [\begin{array}{ccc}
1 & \cdots & 1\\ 1 & \cdots & 1\\ \cdots && \cdots\\ 1 & \cdots & 1
\end{array}\right ].\]
Then, we have
\begin{align*} SSTO & ={\bf Y}^{t}{\bf Y}-\left (\frac{1}{n}\right ){\bf Y}^{t}{\bf JY}\\ & ={\bf Y}^{t}\left [{\bf I}-\left (\frac{1}{n}\right ){\bf J}\right ]{\bf Y}]\\
SSE & ={\bf e}^{t}{\bf e}\\ & =({\bf Y}-{\bf Xb})^{t}({\bf Y}-{\bf Xb})\\ & ={\bf Y}^{t}{\bf Y}-{\bf b}^{t}{\bf X}^{t}{\bf Y}\\ & ={\bf Y}^{t}({\bf I}-{\bf H}){\bf Y}]\\
SSR & ={\bf b}^{t}{\bf X}^{t}{\bf Y}-\left (\frac{1}{n}\right ){\bf Y}^{t}{\bf JY}\\ & ={\bf Y}^{t}\left [{\bf H}-\left (\frac{1}{n}\right ){\bf J}\right ]{\bf Y}.\end{align*}
The \(SSTO\) has \(n-1\) degrees of freedom associated with it. The \(SSE\) has \(n-p\) degrees of freedom associated with it, since \(p\) parameters need to be estimated in the regression function. Finally, the \(SSR\) has \(p-1\) degrees of freedom associated with it, which represents the number of \(X\) variables \(X_{1},\cdots ,X_{p-1}\). Then, the mean squares are given by
\[MSR=\frac{SSR}{p-1}\mbox{ and }MSE=\frac{SSE}{n-p}.\]
We have the following ANOVA table for general linear regression model
\[\begin{array}{lccc}\hline
\hline \mbox{Source of Variation} & SS & df & MS \\
\hline \mbox{Regression} & {\displaystyle SSR={\bf b}^{t}{\bf X}^{t}{\bf Y}-\left (\frac{1}{n}\right ){\bf Y}^{t}{\bf JY}} & p-1 & {\displaystyle MSR=\frac{SSR}{p-1}}\\
\mbox{Error} & SSE={\bf Y}^{t}{\bf Y}-{\bf b}^{t}{\bf X}^{t}{\bf Y} & n-p & {\displaystyle MSE=\frac{SSE}{n-p}}\\
\hline \mbox{Total} & {\displaystyle SSTO={\bf Y}^{t}{\bf Y}-\left (\frac{1}{n}\right ) {\bf Y}^{t}{\bf JY}} & n-1 &\\
\hline\end{array}\]
To test whether there is a regression relation between the dependent variable \(Y\) and the set of independent variables \(X_{1},\cdots ,X_{p-1}\), i.e., to consider the following hypotheses testing
\[\begin{array}{l}
H_{0}:\beta_{1}=\beta_{2}=\cdots =\beta_{p-1}=0\\
H_{a}:\mbox{not all \(\beta_{k}\) \((k=1,\cdots ,p-1)\) equal to zero}.
\end{array}\]
We use the test statistic
\[F^{*}=\frac{MSR}{MSE}.\]
The decision rule to control the Type I error at \(\alpha\) is given by
\[\begin{array}{l}
\mbox{If }F^{*}\leq F_{1-\alpha ;p-1,n-p},\mbox{ we conclude }H_{0}\\ \mbox{If }F^{*}>F_{1-\alpha ;p-1,n-p},\mbox{ we conclude }H_{a}.
\end{array}\]
The least squares estimators in \({\bf b}\) are unbiased, i.e. \(\mathbb{E}({\bf b})=\beta\). The covariance matrix of \({\bf b}\) is
\[\Sigma_{\bf b}=\left [\begin{array}{cccc} \mathbb{V}(b{0}) & \mbox{Cov}(b_{0},b_{1}) & \cdots & \mbox{Cov}(b_{0},b_{p-1})\\
\mbox{Cov}(b_{1},b_{0}) & \mathbb{V}(b_{1}) & \cdots & \mbox{Cov}(b_{1},b_{p-1})\\
\vdots & \vdots && \vdots\\
\mbox{Cov}(b_{p-1},b_{0}) & \mbox{Cov}(b_{p-1},b_{1}) & \cdots & \mathbb{V}(b_{p-1})
\end{array}\right ]=\sigma^{2}({\bf X}^{t}{\bf X})^{-1}.\]
The estimated covariance matrix \(\widehat{\Sigma}_{\bf b}\) is given by
\begin{align*} \widehat{\Sigma}_{\bf b} & =\left [\begin{array}{cccc} \hat{\sigma}^{2}(b{0}) & \widehat{\mbox{Cov}}(b_{0},b_{1}) & \cdots & \widehat{\mbox{Cov}}(b_{0},b_{p-1})\\
\widehat{\mbox{Cov}}(b_{1},b_{0}) & \hat{\sigma}^{2}(b_{1}) & \cdots & \widehat{\mbox{Cov}}(b_{1},b_{p-1})\\
\vdots & \vdots && \vdots\\
\widehat{\mbox{Cov}}(b_{p-1},b_{0}) & \widehat{\mbox{Cov}}(b_{p-1},b_{1}) & \cdots & \hat{\sigma}^{2}(b_{p-1})\end{array}\right ]=MSE({\bf X}^{t}{\bf X})^{-1}.\end{align*}
From \(\widehat{\Sigma}_{\bf b}\), we can obtain \(\hat{\sigma}^{2}(b_{k})\). It can be shown that
\[\frac{b_{k}-\beta_{k}}{\hat{\sigma}(b_{k})}\]
is distributed as \(t_{n-p}\) for \(k=0,1,\cdots ,p-1\). Therefore, the confidence interval for \(\beta_{k}\) with \(1-\alpha\) confidence coefficient are given by
\[b_{k}\pm t_{1-\alpha /2;n-p}\hat{\sigma}(b_{k}).\]
To test
\[\begin{array}{l}
H_{0}:\beta_{k}=0\\ H_{a}:\beta_{k}\neq 0
\end{array}\]
We may use the test statistic
\[t^{*}=\frac{b_{k}}{\hat{\sigma}(b_{k})}.\]
The decision rule is given by
\[\begin{array}{l}
\mbox{If }|t^{*}|\leq t_{1-\alpha /2;n-p},\mbox{ we conclude }H_{0}\\ \mbox{If }|t^{*}|>t_{1-\alpha /2;n-p},\mbox{ we conclude }H_{a}.
\end{array}\]
For given values \({\bf X}_{h}=(X_{h1},X_{h2},\cdots,X_{h,p-1})^{t}\), the mean response is denoted by \(\mathbb{E}(Y_{h})\). We know \(\mathbb{E}(Y_{h})={\bf X}_{h}^{t}\beta\). The estimated mean response corresponding to \({\bf X}_{h}\), denoted by \(\hat{Y}_{h}\), is \(\hat{Y}_{h}={\bf X}_{h}^{t} {\bf b}\). This estimator is unbiased by
\[\mathbb{E}(\hat{Y}_{h})={\bf X}_{h}^{t}\beta=\mathbb{E}(Y_{h})\]
and its variance is given by
\begin{align*} \mathbb{V}(\hat{Y}_{h}) & =\sigma^{2}{\bf X}_{h}^{t}({\bf X}^{t}{\bf X})^{-1}{\bf X}_{h}\\ & ={\bf X}_{h}^{t}\Sigma_{\bf b}{\bf X}_{h}.\end{align*}
The estimated variance \(\hat{\sigma}^{2}(\hat{Y}_{h})\) is given by
\begin{align*} \hat{\sigma}^{2}(\hat{Y}_{h}) & =MSE({\bf X}_{h}^{t}({\bf X}^{t}{\bf X})^{-1}{\bf X}_{h})\\ & ={\bf X}_{h}^{t}\Sigma_{\bf b}{\bf X}_{h}.\end{align*}
The \(1-\alpha\) confidence interval for \(\mathbb{E}(Y_{h})\) is given by
\[\hat{Y}_{h}\pm t{1-\alpha /2;n-p}\hat{\sigma}(\hat{Y}_{h}).\]
Example. We have the following data
\[\begin{array}{cccc}
\hline \mbox{District} & \mbox{Sales} & \mbox{Population} & \mbox{Income}\\
i & Y_{i} & X_{i1} & X_{i2}\\
\hline 1 & 162 & 274 & 2450\\
2 & 120 & 180 & 3254\\
3 & 223 & 375 & 3802\\
4 & 131 & 205 & 2838\\
5 & 67 & 86 & 2347\\
6 & 169 & 265 & 3782\\
7 & 81 & 98 & 3008\\
8 & 192 & 330 & 2450\\
9 & 116 & 195 & 2137\\
10 & 55 & 53 & 2560\\
11 & 252 & 430 & 4020\\
12 & 232 & 372 & 4427\\
13 & 144 & 236 & 2660\\
14 & 103 & 157 & 2088\\
15 & 212 & 370 & 2605\\
\hline\end{array}\]
The regression model is
\[Y_{i}=\beta_{0}+\beta{1}X_{i1}+\beta_{2}X_{i2}+\varepsilon_{i}.\]
We obtain
\[{\bf X}^{t}{\bf X}=\left [\begin{array}{ccc}
15 & 3626 & 44428\\ 3626 & 1067614 & 11419181\\ 44428 & 11419181 & 139063428
\end{array}\right ],
{\bf X}^{t}{\bf Y}=\left [\begin{array}{c}
2259\\ 647107\\ 7096619 \end{array}\right ]\]
and
\[({\bf X}^{t}{\bf X})^{-1}=\left [\begin{array}{ccc}
1.2463484 & 2.1296642E-4 & -4.1567125E-4\\
2.1296642E-4 & 7.7329030E-6 & -7.0302518E-7\\
-4.1567125E-4 & -7.0302518E-7 & 1.9771851E-7
\end{array}\right ].\]
The estimates are
\begin{align*} {\bf b} & =({\bf X}^{t}{\bf X})^{-1}{\bf X}^{t}{\bf Y}\\ & =\left [\begin{array}{c}
3.453\\ 0.496\\ 0.0092 \end{array}\right ]\\ & =\left [\begin{array}{c}
b_{0}\\ b_{1}\\ b_{2} \end{array}\right ].\end{align*}
Therefore, the estimated regression function is given by
\[\hat{Y}=3.453+0.496X_{1}+0.0092X_{2}.\]
For the ANOVA approach, we have
\begin{align*}
& SSTO={\bf Y}^{t}{\bf Y}-\left (\frac{1}{n}\right ){\bf Y}^{t}{\bf JY}=394107-340205.4=53901.6\\
& SSE={\bf Y}^{t}{\bf Y}-{\bf b}^{t}{\bf X}^{t}{\bf Y}=394107-394050.116=56.884\\
& SSR=SSTO-SSE=53901.6-56.884=53844.716
\end{align*}
Then, we have the following ANOVA table
\[\begin{array}{lccc}\hline
\hline \mbox{Source of Variation} & SS & df & MS\\
\hline \mbox{Regression} & SSR=53844.716 & 2 & MSR=26922.358\\
\mbox{Error} & SSE=56.844 & 12 & MSE=4.74\\
\hline \mbox{Total} & SSTO=53901.6 & 14 & \\
\hline\end{array}\]
To test whether sales are related to population and income
\[\begin{array}{l}
H_{0}:\beta_{1}=\beta_{2}=0\\ H_{a}:\mbox{not both \(\beta_{1}\) and \(\beta_{2}\) equal zero}
\end{array}\]
We use test statistic
\begin{align*} F^{*} & =\frac{MSR}{MSE}\\ & =\frac{26922.358}{4.74}=5680.\end{align*}
Assuming \(\alpha\) is taken to be \(0.05\), we require \(F_{0.95;2,12}=3.89\). Since \(F^{*}=5680>3.89\), we conclude \(H_{a}\), which says that sales are related to population and income.
The estimated covariance matrix \(\widehat{\Sigma}_{\bf b}\) is given by
\begin{align*} \widehat{\Sigma}_{\bf b} & =MSE({\bf X}^{t}{\bf X})^{-1}\\ & =
\left [\begin{array}{ccc}
5.9081 & 1.0095E-3 & -1.9704E-3\\
1.0095E-3 & 3.6656E-5 & -3.3326E-6\\
-1.9704E-3 & -3.3326E-6 & 9.3725E-7
\end{array}\right ].\end{align*}
We would like to estimate expected (mean) sales in a district with population \(X_{h1}=220\) and income \(X_{h2}=2500\). We define \({\bf X}_{h}=(1,220,2500)^{t}\). The point estimate of mean sales is given by
\begin{align*} \hat{Y}{h} & ={\bf X}_{h}^{t}{\bf b}\\ & =\left[\begin{array}{ccc}
1 & 220 & 2500 \end{array}\right ]\left [\begin{array}{c}
3.4526\\ 0.496\\ 0.0092 \end{array}\right ]\\ & =135.57\end{align*}
The estimated variance is given by
\[\hat{\sigma}^{2}(\hat{Y}_{h})={\bf X}_{h}^{t}\widehat{\Sigma}_{\bf b}{\bf X}_{h}=0.46638\mbox{ or }\hat{\sigma}(\hat{Y}_{h})=0.68292\]
Assume that the confidence coefficient for the interval estimate of \(\mathbb{E}(Y_{h})\) is taken to be \(0.95\). We then need \(t_{0.975;12}=2.179\). The confidence interval for \(\mathbb{E}(Y_{h})\) is \(135.57\pm 2.179*0.68292\), i.e. \(134.1\leq\mathbb{E}(Y_{h})\leq 137.1\). \(\sharp\)


