Luigi Bechi (1830-1919) was an Italian painter.
We have sections
\begin{equation}{\label{a}}\tag{A}\mbox{}\end{equation}
Maximum Likelihood Estimator.
Suppose that a random experiment is repeated \(n\) times with observations \(x_{1},x_{2},\cdots ,x_{n}\). The collection of these \(n\) values, \(x_{1},x_{2},\cdots ,x_{n}\) is called a sample. There are many characteristics associated with these data. A measure of the center of the data is called the mean of the sample, or the sample mean, which is defined by
\[\bar{x}=\frac{1}{n}\sum_{i=1}^{n} x_{i}.\]
The variance of the sample, or the sample variance, is defined by
\[s^{2}=\frac{1}{n-1}\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}.\]
The sample standard deviation \(s=\sqrt{s^{2}}\) gives a measure of how dispersed the data from the sample mean.
For example, rolling a fair eight-sided die five times could result in the sample of \(n=5\) observations \(x_{1}=3\), \(x_{2}=7\), \(x_{3}=2\), \(x_{4}=5\) and \(x_{5}=3\). In this case, we have
\[\bar{x}=\frac{3+7+2+5+3}{5}=4\]
and
\[s^{2}=\frac{(3-4)^{2}+(7-4)^{2}+(2-4)^{2}+(5-4)^{2}+(3-4)^{2}}{4}=\frac{16}{4}=4.\]
It follows \(s=\sqrt{4}=2\). The standard deviation \(s\) can be thought of as the average distance that the \(x\)-values are away from the sample mean \(\bar{x}\). Clearly, this is not the case because in this example the distances from \(\bar{x}=4\) are \(1,3,2,1,1\) with an average \(1.6\). In general, \(s\) will be somewhat larger than this average distance, but
this approximation should give the reader some idea about the meaning of a standard deviation. There is an alternative way of computing \(s^{2}\), since we have
\begin{align*} \sum_{i=1}^{n} (x_{i}-\bar{x})^{2} & =\sum_{i=1}^{n}(x_{i}^{2}-2\bar{x}x_{i}^{2}+\bar{x}^{2})\\ & =\sum_{i=1}^{n} x_{i}^{2}-
2\bar{x}\sum_{i=1}^{n} x_{i}+\sum_{i=1}^{n} \bar{x}^{2}\\ & =\sum_{i=1}^{n} x_{i}^{2}-2\bar{x}(n\bar{x})+n\bar{x}^{2}\\ & =\sum_{i=1}^{n} x_{i}^{2}-n\bar{x}^{2}.\end{align*}
Therefore, we obtain
\begin{align*} s^{2} & =\frac{{\displaystyle \sum_{i=1}^{n} x_{i}^{2}-n\bar{x}^{2}}}{n-1}\\ & =\frac{{\displaystyle \sum_{i=1}^{n} x_{i}^{2}-\frac{1}{n}\left (
\sum_{i=1}^{n} x_{i}\right )^{2}}}{n-1}.\end{align*}
In this example, we have
\[s^{2}=\frac{3^{2}+7^{2}+2^{2}+5^{2}+3^{2}-5\cdot 4^{2}}{4}=\frac{16}{4}=4.\]
The sample mean \(\bar{x}\) can be thought of as an estimate of the distribution mean \(\mu\), and the sample variance \(s^{2}\) can be thought of as an estimate of the distribution variance \(\sigma^{2}\).
We consider random variables for which the functional form of the p.d.f. is known, and the distribution depends on the unknown paramete \(\theta\) that may have any value in a set \(\Omega\), which is called the parameter space. For example, we consider
\[f(x;\theta )=(1/\theta )e^{-x/\theta}\]
for \(0<x<\infty\) and
\[\theta\in\Omega ={\theta :0<\theta <\infty}.\]
In certain instances, it might be necessary for the experimenter to select precisely one member of the family
\[\{f(x;\theta ):\theta\in\Omega\}\]
as the most likely p.d.f. of the random variable. That is, the experimenter needs a point estimate of the parameter \(\theta\), namely the value of the parameter that
corresponds to the p.d.f. selected. In estimation, we take a random sample from the distribution to elicit some information about the unknown parameter \(\theta\). In other words, we repeat the experiment \(n\) independent times, and observe the sample \(X_{1},X_{2},\cdots ,X_{n}\). We try to estimate the value of \(\theta\) using these observations \(x_{1},x_{2},\cdots , x_{n}\). The function \(u(X_{1},X_{2},\cdots ,X_{n})\) of \(X_{1},X_{2},\cdots ,X_{n}\) used to estimate \(\theta\) is called an estimate of \(\theta\). We want this computed estimate \(u(x_{1},x_{2},\cdots ,x_{n})\) is usually close to \(\theta\). Since we are estimating one member of \(\theta\in\Omega\), this estimate is often called a point estimator.
Suppose that \(X\) is \(B(1,p)\) such that the p.d.f. of \(X\) is \(f(x;p)=p^{x}(1-p)^{1-x}\) for \(x=0,1\) and \(0\leq p\leq 1\). We note that
\[p\in\Omega =\{p:0\leq p\leq 1\},\]
where \(\Omega\) represents the parameter space; that is, the space of all possible value of the parameter \(p\). Given a random sample \(X_{1},X_{2},\cdots ,X_{n}\), the
problem is to find an estimate \(u(X_{1},X_{2},\cdots ,X_{n})\) such that \(u(x_{1},x_{2},\cdots ,x_{n})\) is a good point estimate of \(p\), where \(x_{1},x_{2},\cdots ,x_{n}\) are the observed values of the random sample. Now, we have
\begin{align*} \mathbb{P}(X_{1}=x_{1},\cdots ,X_{n}=x_{n}) & =\prod_{i=1}^{n} p^{x_{i}}(1-p)^{1-x_{i}}\\ & =p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}\end{align*}
which is the joint p.d.f. of \(X_{1},X_{2},\cdots ,X_{n}\) evaluated at the values observed. One reasonable way to proceed toward finding a good estimate of \(p\) is to regard this joint p.d.f. as a function of \(p\) and find the value of \(p\) that maximizes it. In other words, we are going to find the \(p\) value most likely to have produced these sample values. The joint p.d.f., when regarded as a function of \(p\), is frequently called the likelihood function. Therefore, the likelihood function is given by
\begin{align*} L(p) & =L(p;x_{1},\cdots ,x_{n})\\ & =f(x_{1};p)\cdots f(x_{n};p)\\ & =p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}.\end{align*}
To find the value of \(p\) that maximizes \(L(p)\), we first take its derivative for \(0<p<1\)
\[\frac{dL}{dp}=\left (\sum x_{i}\right )p^{\sum x_{1}-1}(1-p)^{n-\sum x_{i}}-\left (n-\sum x_{i}\right )p^{\sum x_{i}}(1-p)^{n-\sum x_{i}-1}.\]
Setting this first derivative equal to zero gives us
\[p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}\left [\frac{\sum x_{i}}{p}-\frac{n-\sum x_{i}}{1-p}\right ]=0.\]
Since \(0<p<1\), this equal to zero when
\[\frac{\sum x_{i}}{p}-\frac{n-\sum x_{i}}{1-p}=0.\]
Therefore, we obtain
\[p=\frac{\sum x_{i}}{n}=\bar{x}.\]
The corresponding statistic \(\sum X_{i}=\bar{X}\) is called the maximum likelihood estimator and is denoted by \(\widehat{p}\); that is
\[\widehat{p}=\frac{1}{n}\sum_{i=1}^{n} X_{i}=\bar{X}.\]
When finding a maximum likelihood estimator, it is often easier to find the value of the parameter that maximizes the natural logarithm of the likelihood function rather than the value of the parameter that maximizes the likelihood function itself. Since the natural logarithm function is an increasing function, the solutions will be the same. To see this, the example, for \(0<p<1\), we have
\[\ln L(p)=\left (\sum_{i=1}^{n} x_{i}\right )\ln p+\left (n-\sum_{i=1}^{n} x_{i}\right )\ln (1-p).\]
To find the maximum, we set the first derivative equal to zero to obtain
\[\frac{d\ln L(p)}{dp}=\left (\sum_{i=1}^{n} x_{i}\right )\left (\frac{1}{p}\right )+\left (n-\sum_{i=1}^{n} x_{i}\right )\left (\frac{-1}{1-p}\right )=0.\]
Therefore, the solution is \(p=\bar{x}\), and the maximum likelihood estimator for \(p\) is \(\widehat{p}=\bar{X}\).
Motivated by the preceding illustration, we present the formal defintion of maximum likelihood estimators. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from a distribution that depends on one or more unknown parameters \(\theta_{1},\theta_{2},\cdots ,\theta_{m}\) with p.d.f. denoted by \(f(x;\theta_{1},\cdots ,\theta_{m})\). Suppose that \((\theta_{1}, \theta_{2},\cdots ,\theta_{m})\) is restricted to a given parameter space \(\Omega\). Then, the joint p.d.f. of \(X_{1},X_{2},\cdots ,X_{n}\)
\[L(\theta_{1},\cdots ,\theta_{m})=\prod_{i=1}^{m}f(x_{i};\theta_{1},\cdots ,\theta_{m})\mbox{ for }(\theta_{1},\cdots ,x_{m})\in\Omega,\]
when regarded as a function of \(\theta_{1},\theta_{2},\cdots ,\theta_{m}\), is called the likelihood function. Let
\[(u_{1}(x_{1},\cdots ,x_{n}),\cdots ,u_{m}(x_{1},\cdots x_{n}))\]
be the \(m\)-tuple in \(\Omega\) that maximizes \(L(\theta_{1},\cdots ,\theta_{m})\). Then
\[\widehat{\theta}_{1}=u_{1}(X_{1},\cdots ,X_{n}),\widehat{\theta}_{2}=u_{2}(X_{1},\cdots ,X_{n}),\cdots ,\widehat{\theta}_{m}=u_{m}(X_{1},\cdots ,X_{n})\]
are maximum likelihood estimators of \(\theta_{1},\theta_{2},\cdots ,\theta_{m}\), respectively. The corresponding observed values of these statistics
\[\theta_{1}=u_{1}(x_{1},\cdots ,x_{n}),\theta_{2}=u_{2}(x_{1},\cdots ,x_{n}),\cdots ,\theta_{m}=u_{m}(x_{1},\cdots ,x_{n})\]
are called maximum likelihood estimates. In many practical cases, these estimators and estimates are unique.
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample (i.i.d.) from the exponential distribution with p.d.f.
\[f(x;\theta )=\frac{1}{\theta}e^{-x/\theta}\]
for \(0<x<\infty\) and
\[\theta\in\Omega =\{\theta :0<\theta <\infty\}.\]
The likelihood function is given by
\begin{align*} L(\theta ) & =L(\theta ;x_{1},\cdots ,x_{n})\\
& =\left (\frac{1}{\theta}e^{-x_{1}/\theta}\right )\left (\frac{1}{\theta}e^{-x_{2}/\theta}\right )\cdots\left (\frac{1}{\theta}e^{-x_{n}/\theta}\right )\\
& =\frac{1}{\theta^{n}}\exp\left (\frac{-\sum_{i=1}^{n} x_{i}}{\theta}\right )\end{align*}
for \(0<\theta <\infty\). The natural logarithm of \(L(\theta )\) is given by
\[\ln L(\theta )=-n\ln\theta-\frac{1}{\theta}\sum_{i=1}^{n} x_{i}\]
for \(0<\theta <\infty\). Therefore, we obtain
\[\frac{d\ln L(\theta )}{d\theta}=\frac{-n}{\theta}+\frac{\sum_{i=1}^{n}x_{i}}{\theta^{2}}=0.\]
The solution of this equation for \(\theta\) is
\[\theta =\frac{1}{n}\sum_{i=1}^{n} x_{i}=\bar{x}.\]
We also have
\begin{align*} \frac{d\ln L(\theta )}{d\theta} & =\frac{1}{\theta}\left (-n+\frac{n\bar{x}}{\theta}\right )\left\{\begin{array}{ll} >0 & \mbox{if \(\theta <\bar{x}\)}\\ =0 & \mbox{if \(\theta =\bar{x}\)}\\ <0 & \mbox{if \(\theta >\bar{x}\).}\end{array}\right .\end{align*}
Therefore \(\ln L(\theta )\) does have a maximum at \(\bar{x}\), which says that maximum likelihood estimator for \(\theta\) is
\[\widehat{\theta}=\bar{X}=\frac{1}{n}\sum_{i=1}^{n} X_{i}.\]
Example. Let \(X_{1},X_{2},\cdots, X_{n}\) be a random sample (i.i.d.) from the geometric distribution with p.d.f. \(f(x;p)=(1-p)^{x-1}p\) for \(x=1,2,3,\cdots\). The likelihood function is given by
\begin{align*} L(p) & =(1-p)^{x_{1}-1}p\cdot (1-p)^{x_{2}-1}p\cdots (1-p)^{x-{n}-1}p\\ & =p^{n}(1-p)^{\sum x_{i}-n}\end{align*}
for \(0\leq p\leq 1\). The natural logarithm of \(L(p)\) is given by
\[\ln L(p)=n\ln p+\left (\sum_{i=1}^{n} x_{i}-n\right )\ln (1-p)\]
for \(0<p<1\). Restricting \(p\) to \(0<p<1\), we can take the derivative to give
\[\frac{d\ln L(p)}{dp}=\frac{n}{p}-\frac{\sum_{i=1}^{n}-n}{1-p}=0.\]
Solving for \(p\), we obtain
\[p=\frac{n}{\sum_{i=1}^{n} x_{i}}=\frac{1}{\bar{x}}.\]
Therefore, the maximum likelihood estimator of \(p\) is given by
\[\widehat{p}=\frac{n}{\sum_{i=1}^{n} X_{i}}=\frac{1}{\bar{X}}.\]
This estimator agrees with our intuition since, in \(n\) observations of a geometric random variable, there are \(n\) successes in the \(\sum_{i=1}^{n} x_{i}\) trials. Therefore. the setimate of \(p\) is the number of successes divided by the total number of trials. \(\sharp\)
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from \(N(\theta_{1},\theta_{2})\), where
\[\Omega =\{(\theta_{1},\theta_{2}):-\infty <\theta_{1}<\infty ,0<\theta_{2}<\infty\}.\]
Let \(\theta_{1}=\mu\) and \(\theta_{2}=\sigma^{2}\). Then, we have
\begin{align*} L(\theta_{1},\theta_{2}) & =\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\theta_{2}}}\exp\left [-\frac{(x_{i}-\theta_{1})^{2}}{2\theta_{2}}\right ]\\ & =
\left (\frac{1}{\sqrt{2\pi\theta_{2}}}\right )^{n}\exp\left [\frac{-\sum_{i=1}^{n} (x_{i}-\theta_{1})^{2}}{2\theta_{2}}\right ].\end{align*}
The natural logarithm of the likelihood function is given by
\[\ln L(\theta_{1},\theta_{2})=-\frac{n}{2}\ln (2\pi\theta_{2})-\frac{\sum_{i=1}^{n} (x_{i}-\theta_{1})^{2}}{2\theta_{2}}.\]
The partial derivatives with respect to \(\theta_{1}\) and \(\theta_{2}\) are given by
\[\frac{\partial\ln L}{\partial\theta_{1}}=\frac{1}{\theta_{2}}\sum_{i=1}^{n} (x_{i}-\theta_{1})\]
and
\[\frac{\partial\ln L}{\partial\theta_{2}}=-\frac{n}{2\theta_{2}}+\frac{1}{2\theta_{2}^{2}}\sum_{i=1}^{n} (x_{i}-\theta_{1})^{2}.\]
The equation \(\partial\ln L/\partial\theta_{1}=0\) has the solution \(\theta_{1}=\bar{x}\). Setting \(\partial\ln L/\partial\theta_{2}=0\) and replacing \(\theta_{1}\) by \(\bar{x}\) yields
\[\theta_{2}=\frac{1}{n}\sum_{i=1}^{n} (x_{2}-\bar{x})^{2}.\]
By considering the second partial derivatives, these solutions do provide a maximum. Therefore, the maximum likelihood estimators of \(\mu =\theta_{1}\) and \(\sigma^{2}=\theta_{2}\) are
\[\widehat{\theta}_{1}=\bar{X}\mbox{ and }\widehat{\theta}_{2}=\frac{1}{n}\sum_{i=1}^{n} (X_{i}-\bar{X})^{2}.\]
Definition. If \(E[u(X_{1},\cdots ,X_{n})]=\theta\), the statistic \(u(X_{1},\cdots ,X_{n})\) is called an unbiased estimator of \(\theta\). Otherwise, it is said to be biased.
Example. We have shown that, when sampling from \(N(\theta_{1},\theta_{2})\), the maximum likelihood estimators of \(\theta_{1}\) and \(\theta_{2}\) are
\[\widehat{\theta}_{1}=\bar{X}\mbox{ and }\widehat{\theta}_{2}=\frac{(n-1)S^{2}}{n}.\]
Since the distribution of \(\bar{X}\) is \(N(\mu ,\sigma^{2}/n)\), it follows \(\mathbb{E}(\bar{X})=\mu\). It says that \(\bar{X}\) is an unbiased estimator of \(\mu\). We also have shown that the distribution of \((n-1)S^{2}/\sigma^{2}\) is \(\chi^{2}(n-1)\). Therefore, we have
\begin{align*} \mathbb{E}(S^{2}) & =\mathbb{E}\left [\frac{\sigma^{2}}{n-1}\cdot\frac{(n-1)S^{2}}{\sigma^{2}}\right ]\\ & =\frac{\sigma^{2}}{n-1}\cdot (n-1)=\sigma^{2},\end{align*}
which says that the sample variance
\[S^{2}=\frac{1}{n-1}\sum_{i=1}^{n} (X_{i}-\bar{X})^{2}\]
is an unbiased estimator of \(\sigma^{2}\). Since
\[\mathbb{E}(\widehat{\theta}_{2})=\frac{n-1}{n}\mathbb{E}(S^{2})=\frac{n-1}{n}\sigma^{2},\]
it says that \(\widehat{\theta}_{2}\) is a biased estimator of \(\theta_{2}=\sigma^{2}\). \(\sharp\)
Sometimes it is impossible to find maximum likelihood estimator in a convenient closed form . The numerical methods must be used to maximize the likelihood function. For illustration, suppose that \(X_{1},X_{2},\cdots ,X_{n}\) is a random sample from a gamma distribution with parameters \(\alpha =\theta_{1}\) and \(\beta =\theta_{2}\), where \(\theta_{1}>0\) and \(\theta_{2}>0\). It is difficult to maximize
\[L(\theta_{1},\theta_{2};x_{1},\cdots ,x_{n})=\left [\frac{1}{\Gamma (\theta_{1})\theta_{2}^{\theta_{1}}}\right ]^{n}
(x_{1}x_{2}\cdots x_{n})^{\theta_{1}-1}\exp\left (-\sum_{i=1}^{n}x_{i}/\theta_{2}\right ).\]
with respect to \(\theta_{1}\) and \(\theta_{2}\), owing to the presence of the gamma function \(\Gamma (\theta_{1})\). Therefore, the numerical methods must be used to maximize \(L\) once \(x_{1},x_{2},\cdots ,x_{n}\) are observed.
\begin{equation}{\label{b}}\tag{B}\mbox{}\end{equation}
Method of Moments
There are other ways to obtain point estimates of \(\theta_{1}\) and \(\theta_{2}\), which is the so-called method of moments described below.
Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample of size \(n\) from a distribution with p.d.f. \(f(x;\theta_{1},\cdots ,\theta_{m})\), \((\theta_{1},\theta_{2},\cdots ,\theta_{m})\in\Omega\). The expectation \(\mathbb{E}(X^{k})\) is frequently called the \(k\)th moment of the distribution for \(k=1,2,3,\cdots\). The sum \(M_{k}=\sum_{i=1}^{k} X_{i}/n\) is the \(k\)th moment of the sample for \(k=1,2,3,\cdots\). The method of moments can be described as follows. Set \(\mathbb{E}(X^{k})\) to \(M_{k}\). Beginning with \(k=1\) and continuing until there are enough equations to provide unique solutions for \(\theta_{1},\theta_{2},\cdots ,\theta_{m}\), say \(h_{i}(M_{1},M_{2},\cdots )\) for \(i=1,2,\cdots, m\), respectively, it should be noted that this could be done in an equivalent manner by equating \(\mu =\mathbb{E}(X)\) to \(\bar{X}\) and \(\mathbb{E}[(X-\mu )^{k}]\) to \(\sum_{i=1}^{n} (X_{i}-\bar{X})^{k}/n\) for \(k=2,3\), and
so on until unique solutions for \(\theta_{1},\theta_{2},\cdots ,\theta_{m}\) are obtained. In most practical cases, the estimator \(\widehat{\theta_{i}}=h_{i}(M_{1},M_{2},\cdots )\) of \(\theta_{i}\), found by the method of moments, is an estimator of \(\theta_{i}\) that in some sense gets close to that parameter when \(n\) is large for \(i=1,2,\cdots ,m\). For illustration, in the gamma distribution, let us equate
\[\theta_{1}\theta_{2}=\mathbb{E}(X)=\bar{X}\]
and
\[\theta_{1}\theta_{2}^{2}=\mathbb{E}[(X-\mu )^{2}]=\sum_{k=1}^{n} (X_{i}-\bar{X})^{2}/2.\]
Then, we solve for \(\theta_{1}\) and \(\theta_{2}\) to obtain
\[\widehat{\theta}_{1}=\frac{n\bar{X}^{2}}{(n-1)S^{2}}\mbox{ and }\widehat{\theta}_{2}=\frac{(n-1)S^{2}}{n\bar{X}},\]
which are the method of moment estimators of \(\theta_{1}\) and \(\theta_{2}\), respectively.
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample of size \(n\) from the distribution with p.d.f.
\[f(x;\theta )=\theta x^{\theta -1}\]
for \(0<x<1\) and \(0<\theta <\infty\). The expectation of \(X\) is
\[\mathbb{E}(X)=\int_{0}^{1} x\theta x^{\theta -1}dx=\frac{\theta}{\theta +1}.\]
We set the population mean equal to the sample mean and solve for \(\theta\). Therefore, we have
\[\bar{x}=\frac{\theta}{\theta +1}.\]
Solving for \(\theta\), we obtain the method of moments estimator
\[\widehat{\theta}=\frac{\bar{X}}{1-\bar{X}}.\]
Therefore, an estimate of \(\theta\) by the method of moments is \(\bar{x}/(1-\bar{x})\). \(\sharp\)
Example. Let the distribution of \(X\) is \(N(\mu ,\sigma^{2})\). Then
\[\mathbb{E}(X)=\mu\mbox{ and }\mathbb{E}(X^{2})=\sigma^{2}+\mu^{2}.\]
Given a random sample of size \(n\). the first two moments are given by
\[m_{1}=\frac{1}{n}\sum_{i=1}^{n} x_{i}\mbox{ and }m_{2}=\frac{1}{n}\sum_{i=1}^{n} x_{i}^{2}.\]
We set \(m_{1}=\mathbb{E}(X)\) and \(m_{2}=\mathbb{E}(X^{2})\). Solving for \(\mu\) and \(\sigma^{2}\), we obtain
\[\frac{1}{n}\sum_{i=1}^{n} x_{i}=\mu\]
and
\[\frac{1}{n}\sum_{i=1}^{n} x_{i}^{2}=\sigma^{2}+\mu^{2}.\]
The first moment equation yields \(\bar{x}\) as the estimate of \(\mu\). Replacing \(\mu^{2}\) with \(\bar{x}^{2}\) in the second equation and solving for \(\sigma^{2}\), we obtain
\[\frac{1}{n}\sum_{i=1}^{n} x_{i}^{2}-\bar{x}^{2}=\sum_{i=1}^{n}\frac{(x_{i}-\bar{x})^{2}}{n}.\]
for the solution of \(\sigma^{2}\). Therefore, the method of moment estimators for \(\mu\) and \(\sigma^{2}\) are \(\widehat{\mu}=\bar{X}\) and \(\widehat{\sigma}^{2}=(n-1)S^{2}/n\).
\begin{equation}{\label{c}}\tag{C}\mbox{}\end{equation}
Sufficient Statistics and Unbiased Point Estimators.
Let \(X_{1},\cdots ,X_{n}\) be a random sample of size \(n\) from \(f(\cdot ;\theta)\), i.e., \(n\) independent random variables distributed as \(X\). For \(j=1,\cdots ,m\), let \(T_{j}\) be (measurable) function defined on \(\mathbb{R}^{n}\) into \(\mathbb{R}\) and not depending on \(\theta\) or any other unknown quantities. Set \({\bf T}=(T_{1},\cdots ,T_{m})\). Then
\[{\bf T}(X_{1},\cdots ,T_{m})=(T_{1}(X_{1},\cdots ,X_{m}),\cdots ,T_{m}(X_{1},\dots ,X_{n}))\]
is called an \(m\)-dimensional statistic.
Definition. Let \(X_{j}\), \(j=1,\cdots ,n\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta)\), where \(\theta=(\theta_{1},\cdots ,\theta_{r})\in\Omega\subseteq \mathbb{R}^{r}\), and let \({\bf T}=(T_{1},\cdots ,T_{m})\), where
\[T_{j}=T_{j}(X_{1},\cdots ,X_{n}), j=1,\cdots ,m\]
are statistics. We say that \({\bf T}\) is an \(m\)-dimensional sufficient statistic for the family
\[{\cal F}=\{f(\cdot ;\theta): \theta\in\Omega\},\]
or for the parameter \(\theta\), if the conditional distribution of \((X_{1},\cdots ,X_{n})\), given \({\bf T}={\bf t}\), is independent of \(\theta\) for all values of \({\bf t}\) (actually, for almost all \({\bf t}\), that is, except perhaps for a set \(N\in {\cal B}^{m}\) of values of \({\bf t}\) satisfying \(\mathbb{P}_{\theta}({\bf T}\in N)=0\) for all \(\theta\in\Omega\), where \(\mathbb{P}_{\theta}\) denotes the probability measure associated with the p.d.f. \(f(\cdot ; \theta)\). \(\sharp\)
Theorem. (Fisher-Neyman Factorization Theorem) Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta)\), where \(\theta=(\theta_{1},\cdots ,\theta_{r})\in\Omega \subseteq\mathbb{R}^{r}\). An \(m\)-dimensional statistic
\[{\bf T}={\bf T}(X_{1},\cdots ,X_{n})=(T_{1}(X_{1},\cdots ,X_{n}),\cdots ,T_{m}(X_{1},\cdots ,X_{n}))\]
is sufficient for \(\theta\) if and only if the joint p.d.f. of \(X_{1},\cdots ,X_{n}\) factors as follows
\[f(x_{1},\cdots ,x_{n};\theta)=g[{\bf T}(x_{1},\cdots ,x_{n});\theta]h(x_{1},\cdots ,x_{n}),\]
where \(g\) depends on \(x_{1},\cdots ,x_{n}\) only through \({\bf T}\) and \(h\) is (entirely) independent of \(\theta\). \(\sharp\)
\begin{equation}{\label{t8}}\tag{1}\mbox{}\end{equation}
Theorem \ref{t8}. Let \(\phi :\mathbb{R}^{m}\rightarrow \mathbb{R}^{m}\) (measurable and independent of \(\theta\) be one-to-one, so that the inverse \(\phi^{-1}\) exists. Then, if \({\bf T}\) is sufficient for \(\theta\), we have that \(\tilde{{\bf T}}=\phi({\bf T})\) is also sufficient for \(\theta\) and \({\bf T}\) is sufficient for \(\tilde{\theta}=\psi(\theta)\), where \(\psi :\mathbb{R}^{r}\rightarrow\mathbb{R}^{r}\) is one-to-one (and measurable). \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(N(\mu ,\sigma^{2})\). By setting \({\bf x}=(x_{1},\cdots ,x_{n})\), \(\mu =\theta_{1}\), \(\sigma^{2}=\theta_{2}\) and \(\theta=(\theta_{1},\theta_{2})\), we have
\[f({\bf x};\theta)=\left (\frac{1}{\sqrt{2\pi\theta_{2}}}\right )^{n}\exp\left [-\frac{1}{2\theta_{2}}\sum_{j=1}^{n} (x_{j}-\theta_{1})^{2}\right ].\]
Since
\begin{align*} \sum_{j=1}^{n} (x_{j}-\theta_{1})^{2} & =\sum_{j=1}^{n} [(x_{j}-\bar{x})+(\bar{x}-\theta_{1})]^{2}\\ & =\sum_{j=1}^{n} (x_{j}-\bar{x})^{2}+n(\bar{x}-\theta_{1})^{2},\end{align*}
we have
\[f({\bf x};\theta)=\left (\frac{1}{\sqrt{2\pi\theta_{2}}}\right )^{n}\exp\left [-\frac{1}{2\theta_{2}}\sum_{j=1}^{n} (x_{j}-\bar{x})^{2}-\frac{n}{2\theta_{2}}(\bar{x}-\theta_{1})^{2}\right ].\]
It follows that
\[\left (\bar{X},\sum_{j=1}^{n} (X_{j}-\bar{X})^{2}\right )\]
is sufficient for \(\theta\). Since
\[f({\bf x};\theta)=\left (\frac{1}{\sqrt{2\pi\theta_{2}}}\right )^{n}\exp\left (\frac{n\theta_{1}^{2}}{2\theta_{2}}\right )\exp\left (\frac{\theta_{1}}{\theta_{2}}\sum_{j=1}^{n} x_{j}-\frac{1}{2\theta_{2}}\sum_{j=1}^{n} x_{j}^{2}\right )\]
it follows that, if \(\theta_{2}=\sigma^{2}\) is known and \(\theta_{1}= \theta\), then \(\sum_{j=1}^{n} X_{j}\) is sufficient for \(\theta\), whereas if \(\theta_{1}=\mu\) is known and \(\theta_{2}=\theta\), then \(\sum_{j=1}^{n} (X_{j}-\mu )^{2}\) is sufficient for \(\theta\), as follows from the form of \(f({\bf x};\theta)\) at the beginning of this example. By Theorem \ref{t8}, it also follows that \((\bar{X},S^{2})\) is sufficient for \(\theta\), where
\[S^{2}=\frac{1}{n}\sum_{j=1}^{n} (X_{j}-\bar{X})^{2}\]
and
\[\frac{1}{n}\sum_{j=1}^{n} (X_{j}-\mu )^{2}\]
is sufficient for \(\theta_{2}=\theta\) if \(\theta_{1}=\mu\) is known. \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(U(\theta_{1}, \theta_{2})\). Then, by setting \({\bf x}=(x_{1},\cdots,x_{n})\) and \(\theta=(\theta_{1},\theta_{2})\), we get
\begin{align*} f({\bf x};\theta) & =\frac{1}{(\theta_{2}-\theta_{1})^{n}}I_{[\theta_{1},\infty )}(x_{(1)})I_{(-\infty ,\theta_{2}]}(x_{(n)})\\ & =
\frac{1}{(\theta_{2}-\theta_{1})^{n}}g_{1}[x_{(1)},\theta]g_{2}[x_{(n)},\theta],\end{align*}
where \(g_{1}[x_{(1)},\theta]=I_{[\theta_{1},\infty )} (x_{(1)})\), \(g_{2}[x_{(n)},\theta]= I_{(-\infty,\theta_{2}]}(x_{(n)})\). It follows that \((X_{(1)},X_{(n)})\) is sufficient for \(\theta\). In particular, if $\theta_{1}= \alpha$ is known and \(\theta_{2}=\theta\), it follows that \(X_{(n)}\) is sufficient for \(\theta\). Similarly, if \(\theta_{2}=\beta\) is known and \(\theta_{1}=\theta\), \(X_{(1)}\) is sufficient for \(\theta\). \(\sharp\)
Let \({\bf X}\) be a \(k\)-dimensional random vector with p.d.f. \(f(\cdots ;\theta)\), where \(\theta\in\Omega \subseteq\mathbb{R}^{r}\), and let \(g:\mathbb{R}^{k}\rightarrow\mathbb{R}\) be measurable such that \(g({\bf X})\) is a random variable. Assume that \(\mathbb{E}_{\theta}g({\bf X})\) exists for all \(\theta\in\Omega\) and set
\[{\cal F}=\{f(\cdot ;\theta);\theta\in\Omega\}.\]
Definition. With the above notation, we say that the family \({\cal F}\) is complete when, for every \(g\) as above, \(\mathbb{E}_{\theta}g({\bf X})=0\) for all \(\theta\in\Omega\) implies \(g({\bf x})=0\) except possibly on a set \(N\) of \({\bf x}\)’s such that \(\mathbb{P}_{\theta}({\bf X}\in N)=0\) for all \(\theta\in \Omega\). \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(N(\mu ,\sigma^{2})\). If \(\sigma\) is known and \(\mu =\theta\), it can be shown that
\[{\cal F}=\left\{f(\cdot ;\theta ):f(x;\theta )=\frac{1}{\sigma\sqrt{2\pi}}\exp\left [-\frac{(x-\theta )^{2}}{2\sigma^{2}}\right ],\theta\in\mathbb{R}\right\}\]
is complete. If \(\mu\) is unknown and \(\sigma^{2}=\theta\), then
\[{\cal F}=\left\{f(\cdot ;\theta ):f(x;\theta )=\frac{1}{\theta\sqrt{2\pi}}\exp\left [-\frac{(x-\mu )^{2}}{2\theta}\right ],\theta\in (0,\infty )\right\}\]
is not complete. In fact, let \(g(x)=x-\mu\). Then
\[\mathbb{E}_{\theta}g({\bf X})=\mathbb{E}_{\theta}(X-\mu )=0\]
for all \(\theta\in (0,\infty )\), while \(g(x)=0\) only for \(x=\mu\). Finally, if both \(\mu\) and \(\sigma^{2}\) are unknown, it can be shown that \((\bar{X},S^{2})\) is complete. \(\sharp\).
Definition. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta )\), where \(\theta\in\Omega\subseteq\mathbb{R}\), and let \(U=U(X_{1},\cdots ,X_{n})\) be a statistic. Then, we say that \(U\) is unbiased statistic for \(\theta\) when \(\mathbb{E}_{\theta}U=\theta\) for every \(\theta\in\Omega\). \(\sharp\)
Theorem. (Rao-Blackwell). Let \(X_{1},\cdots X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta )\), where \(\theta\in\Omega\subseteq\mathbb{R}\), and let \({\bf T}=(T_{1},\cdots ,T_{m})\), \(T_{j}=T_{j}(X_{1},\cdots ,X_{n})\) for \(j=1,\cdots ,m\) be a sufficient statistic for \(\theta\). Let \(U=U(X_{1},\cdots ,X_{n})\) be an unbiased statistic for \(\theta\) which is not a function of \({\bf T}\) alone (with probability one). Set \(\phi ({\bf t})=\mathbb{E}_{\theta} (U|{\bf T}={\bf t})\). Then, we have the following properties.
(i) The random variable \(\phi ({\bf T})\) is a function of the sufficient statistic \({\bf T}\) alone.
(ii) \(\phi ({\bf T})\) is an unbiased statistic for \(\theta\).
(iii) We have \(\sigma_{\theta}^{2}[\phi ({\bf T})]<\sigma_{\theta}^{2}(U)\) for \(\theta\in\Omega\), provided \(\mathbb{E}_{\theta} U^{2}<\infty\). \(\sharp\)
Theorem. (Uniqueness Theorem). Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta )\), where \(\theta\in\Omega\subseteq\mathbb{R}\), and let
\[{\cal F}=\{f(\cdot ;\theta ):\theta\in\Omega\}.\]
Let \({\bf T}=(T_{1},\cdots ,T_{m})\), \(T_{j}=T_{j}(X_{1},\cdots ,X_{n})\) for \(j=1, \cdots ,m\) be a sufficient statistic for \(\theta\), and let \(g(\cdot ; \theta )\) be its p.d.f. Set
\[{\cal G}=\{g(\cdot;\theta ):\theta\in\Omega\}\]
Assume that \({\cal G}\) is complete. Let \(U=U({\bf T})\) be an unbiased statistic for \(\theta\), and that \(\mathbb{E}_{\theta}U^{2}<\infty\) for all \(\theta\in\Omega\). Then \(U\) is the unique unbiased statistic for \(\theta\) with the smallest variance in the class of all unbiased statistics for \(\theta\) in the sense that, if \(V=V({\bf T})\) is another unbiased statistic for \(\theta\), then \(U({\bf t})=V({\bf t})\) except perhaps on a set \(N\) of \({\bf t}’s\) such that \(\mathbb{P}_{\theta}({\bf T}\in N)=0\) for all \(\theta\in \Omega\). \(\sharp\)
Any statistic \(U=U(X_{1},\cdots ,X_{n})\), which is used for estimating the unknown quantity \(g(\theta)\), is called an estimator of \(g(\theta)\). The value \(U(x_{1},\cdots ,x_{n})\) of \(U\) for the observed values of the \(X\)’s is called an estimate. The estimator of \(g(\theta)\) is called an unbiased estimator of \(g(\theta)\) when \(\mathbb{E}_{\theta}U(X_{1},\cdots,X_{n})=g(\theta)\) for all \(\theta\in\Omega\). Also, \(g\) is said to be estimable if it has an unbiased estimator.
Definition. Let \(g\) be estimable. An estimator \(U=U(X_{1},\cdots ,X_{n})\) is said to be a uniformly minimum variance unbiased (UMVU) estimator of \(g(\theta)\) when it is unbiased and has the smallest variance within the class of all unbiased estimator of \(g(\theta)\) under all \(\theta\in \Omega\). That is, if \(U_{1}=U_{1}(X_{1},\cdots ,X_{n})\) is any other unbiased estimator of \(g(\theta)\), then \(\sigma^{2}_{\theta}U_{1}\geq\sigma^{2}_{\theta}U\) for all \(\theta\in\Omega\). \(\sharp\)
\begin{equation}{\label{t7}}\tag{2}\mbox{}\end{equation}
Theorem \ref{t7}. Assume that there exists an unbiased estimator \(U=U(X_{1},\cdots ,X_{n})\) of \(g(\theta)\) with finite variance. Furthermore, let \({\bf T}=(T_{1},\cdots ,T_{r})\), \(T_{j}=T_{j}(X_{1},\cdots ,X_{n})\) for \(j=1,\cdots ,r\) be a sufficient statistic for \(\theta\), and suppose that it is also complete. Set \(\phi ({\bf T})=\mathbb{E}_{\theta}(U|{\bf T})\). Then \(\phi ({\bf T})\) is a UMVU estimator of \(g(\theta)\) and is essentially unique. \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(B(1,p)\). We want to find a UMVU estimator of the variance of the \(X\)’s. The variance of the \(X\)’s is equal to \(pq\). Therefore, if we set \(p=\theta\), \(\theta\in (0,1)\) and \(g(\theta)=\theta (1-\theta )\), the problem is that of finding UMVU estimator of \(g(\theta )\). We know that, if
\[U=\frac{1}{n-1}\sum_{j=1}^{n} (X_{j}-\bar{X})^{2},\]
then \(\mathbb{E}_{\theta}[U]=g(\theta )\). Thus \(U\) is an unbiased estimator of \(g(\theta )\). Furthermore, we have
\begin{align*} \sum_{j=1}^{n} (X_{j}-\bar{X})^{2} & =\sum_{j=1}^{n} X_{j}^{2}-n\bar{X}^{2}\\ & =\sum_{j=1}^{n} X_{j}-n\left (\frac{1}{n}\sum_{j=1}^{n} X_{j}\right )^{2}\end{align*}
since \(X_{j}\) takes on the values \(0\) and \(1\) only and hence \(X_{j}^{2}=X_{j}\). By setting \(T=\sum_{j=1}^{n} X_{j}\), we have
\[\sum_{j=1}^{n} (X_{j}-\bar{X}^{2})=T-\frac{T^{2}}{n}\]
satisfying
\[U=\frac{1}{n-1}\left (T-\frac{T^{2}}{n}\right ).\]
Since \(T\) is a complete and a sufficient statistic for \(\theta\), it follows that \(U\) is a UMVU estimator of the variance of the \(X\)’s according to Theorem \ref{t7}. Note that \(\mathbb{E}[Yg(X)|X]=g(X)\mathbb{E}[Y|X]\), i.e., \(\mathbb{E}[g(X)|X]=g(X)\). \(\sharp\)
\begin{equation}{\label{e1}}\tag{3}\mbox{}\end{equation}
Example \ref{e1}. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(N(\mu ,\sigma^{2})\) with both \(\mu\) and \(\sigma^{2}\) unknown. We have \(\theta=(\mu ,\sigma^{2})\). Let \(g_{1}(\theta )=\mu\) and \(g_{2}(\theta )=\sigma^{2}\). By setting
\[S^{2}=\frac{1}{n}\sum_{j=1}^{n} (X_{j}-\bar{X})^{2}.\]
we have that \((\bar{X},S^{2})\) is a sufficient statistic for \(\theta\) and is also complete. Let \(U_{1}=\bar{X}\) and \(U_{2}=nS^{2}/(n-1)\). Clearly, we have \(E_{\theta}U_{1}=\mu\). Since \(nS^{2}/\sigma^{2}\) is \(\chi_{n-1}^{2}\), we obtain
\[\mathbb{E}_{\theta}\left (\frac{nS^{2}}{\sigma^{2}}\right )=n-1,\]
which implies
\[\mathbb{E}_{\theta}\left (\frac{nS^{2}}{n-1}\right )=\sigma^{2}.\]
Therefore \(U_{1}\) and \(U_{2}\) are unbiased estimators of \(\mu\) and \(\sigma^{2}\), respectively. Since they depend only on the complete and sufficient statistic \((\bar{X},S^{2})\), it follows that they are UMVU estimators. \(\sharp\)
Regularity conditions: Let \(X\) be a random variable with p.d.f. \(f(\cdot ; \theta )\) for \(\theta\in\Omega\subseteq\mathbb{R}\). Then, it is assumed that the following conditions are satisfied.
- \(f(x;\theta )\) is positive on a set \(S\) independent of \(\theta\in \Omega\).
- $ latex \Omega$ is an open interval in \(\mathbb{R}\).
- \({\displaystyle \frac{\partial}{\partial\theta} f(x;\theta )}\) exists for all \(\theta\in\Omega\) and all \(x\in S\) except possibly on a set \(N \subset S\) which is independent of \(\theta\) and such that \(\mathbb{P}_{\theta}(X\in N)=0\) for all \(\theta\in\Omega\).
- \({\displaystyle \int_{S}\cdots\int_{S} f(x_{1};\theta )\cdots f(x_{n};\theta )dx_{1}\cdots dx_{n}}\) or \({\displaystyle \sum_{S}\cdots \sum_{S} f(x_{1};\theta )\cdots f(x_{n};\theta )}\) may be differentiated under the integral or summation sign, respectively.
- \({\displaystyle I(\theta )\equiv\mathbb{E}_{\theta}\left [\frac{\partial}{\partial\theta}\ln f(X; \theta )\right ]^{2}}>0\) for all \(\theta\in \Omega\).
- \({\displaystyle \int_{S}\cdots\int_{S} U(x_{1},\cdots ,x_{n})f(x_{1};\theta )\cdots f(x_{n};\theta )dx_{1}\cdots dx_{n}}\) or \({\displaystyle \sum_{S}\cdots \sum_{S} U(x_{1},\cdots,x_{n})f(x_{1};\theta )\cdots f(x_{n};\theta )}\) may be differentiated under the integral or summation sign, respectively, where \(U(X_{1},\cdots, X_{n})\) is any unbiased estimator of \(g(\theta)\).
Theorem. (Cramer-Rao Inequality). Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ;\theta )\). Assume that the regularity conditions are fulfilled. Then, for any unbiased estimator \(U=U(X_{1},\cdots ,X_{n})\) of \(g(\theta )\), we have
\[\sigma_{\theta}^{2} U\geq\frac{[g'(\theta )]^{2}}{nI(\theta )},\theta\in\Omega,\]
where
\[g'(\theta )=\frac{dg(\theta )}{d\theta}.\]
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(N(\mu , \sigma^{2})\). Assume that \(\sigma^{2}\) is known and set \(\mu =\theta\). Then, we have
\[f(x;\theta )=\frac{1}{\sigma\sqrt{2\pi}}\exp\left [-\frac{(x-\theta )^{2}}{2\sigma^{2}}\right ]\]
and hence
\[\ln f(x;\theta )=\ln\left (\frac{1}{\sigma\sqrt{2\pi}}\right )-\frac{(x-\theta )^{2}}{2\sigma^{2}}.\]
Since
\[\frac{\partial}{\partial\theta}\ln f(x;\theta )=\frac{1}{\sigma}\frac{x-\theta}{\sigma},\]
we also have
\[\left [\frac{\partial}{\partial\theta}\ln f(x;\theta )\right ]^{2}=\frac{1}{\sigma^{2}}\left (\frac{x-\theta}{\sigma}\right )^{2}.\]
It follows
\[\mathbb{E}_{\theta}\left [\frac{\partial}{\partial\theta}\ln f(X;\theta )\right ]^{2}=\frac{1}{\sigma^{2}},\]
since \((X-\theta )/\sigma\) is \(N(0,1)\) and hence \([(X-\theta)/\sigma]^{2}\) is \(\chi_{1}^{2}\) distribution. It says
\[\mathbb{E}_{\theta}\left (\frac{X-\theta}{\sigma}\right )^{2}=1.\]
Therefore, the Cramer-Rao bound is \(\sigma^{2}/n\), which says that \(\bar{X}\) is a UMVU estimator since \(Var(\bar{X})=\sigma^{2}/n\). Suppose now that \(\mu\) is known and set \(\sigma^{2}=\theta\). Then, we have
\[f(x;\theta )=\frac{1}{\sqrt{2\pi\theta}}\exp\left [-\frac{(x-\mu )^{2}}{2\theta}\right ]\]
such that
\[f(x;\theta )=\frac{1}{\sqrt{2\pi\theta}}\exp\left [-\frac{(x-\mu )^{2}}{2\theta}\right ]\]
and
\[\frac{\partial}{\partial\theta}\ln f(x;\theta )=-\frac{1}{2\theta}+\frac{(x-\mu )^{2}}{2\theta^{2}}.\]
Then, we also have
\[\left [\frac{\partial}{\partial\theta}\ln f(x;\theta )\right ]^{2}=\frac{1}{4\theta^{2}}-\frac{1}{2\theta^{2}}\left (\frac{x-\mu }{\sqrt
{\theta}}\right )^{2}+\frac{1}{4\theta^{2}}\left (\frac{x-\mu}{\sqrt{\theta}}\right )^{4}\]
Since \((X-\mu )/\sqrt{\theta}\) is \(N(0,1)\), we obtain
\[\mathbb{E}_{\theta}\left (\frac{X-\mu}{\sqrt{\theta}}\right )^{2}=1\mbox{ and }\mathbb{E}_{\theta}\left (\frac{X-\mu}{\sqrt{\theta}}\right )^{4}=3.\]
Let \(X\) be \(N(0,1)\). Then \(\mathbb{E}(X^{2n+1})=0\) and \(\mathbb{E}(X^{2n})=(2n)!/2^{n}n!\). Therefore, we obtain
\[\mathbb{E}_{\theta}\left [\frac{\partial}{\partial\theta}\ln f(X;\theta )\right ]^{2}=\frac{1}{2\theta^{2}}\]
and the Cramer-Rao bound is \(2\theta^{2}/n\). Next, since
\[\sum_{j=1}^{n} \left (\frac{X_{j}-\mu}{\sqrt{\theta}}\right )^{2}\]
is \(\chi_{n}^{2}\), it follows
\[\mathbb{E}_{\theta}\left [\sum_{j=1}^{n} \left (\frac{X_{j}-\mu}{\sqrt{\theta}}\right )^{2}\right ]=n\]
and
\[\sigma_{\theta}^{2}\left [\sum_{j=1}^{n} \left (\frac{X_{j}-\mu}{\sqrt{\theta}}\right )^{2}\right ]=2n.\]
Therefore \((1/n)\sum_{j=1}^{n} (X_{j}-\mu )^{2}\) is an unbiased estimator. Now, since
\begin{align*} Var\left [\frac{1}{n}\sum_{j=1}^{n} (X_{j}-\mu )^{2}\right ] & =
\frac{1}{n^{2}}Var\left [\sum_{j=1}^{n} (X_{j}-\mu )^{2}\right ]\\ & =
\frac{1}{n^{2}}\cdot 2n\theta^{2}\\ & =\frac{2\theta^{2}}{n}\end{align*}
equals to the Cramer-Rao bound, which says that \((1/n)\sum_{j=1}^{n} (X_{j}-\mu)^{2}\) is a UMVU estimator of \(\theta\).
Now, we assume that both \(\mu\) and \(\sigma^{2}\) are unknown and set \(\mu =\theta_{1}\), \(\sigma^{2}=\theta_{2}\). It has been seen in Example \ref{e1} that
\[\frac{1}{n-1}\sum_{j=1}^{n} (X_{j}-\bar{X})^{2}\]
is a UMVU estimator of \(\theta_{2}\). Since
\[\sum{j=1}^{n} \left (\frac{X_{j}-\bar{X}}{\sqrt{\theta_{2}}}\right )^{2}\]
is \(\chi_{n-1}^{2}\), it follows
\[\sigma_{\theta}^{2}\left [\frac{1}{n-1}\sum_{j=1}^{n} (X_{j}-\bar{X})^{2}\right ]^{2}=\frac{2\theta_{2}^{2}}{n-1}>\frac{2\theta_{2}^{2}}{n}\]
is the Cramer-Rao bound. This is an example of a case where a UMVU estimator does not exist but its variance is larger than the Cramer-Rao bound. \(\sharp\)
Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdots ;\theta)\) for \(\theta\in\Omega \subseteq\mathbb{R}^{r}\). Consider the joint p.d.f. of the \(X\)’s \(f(x_{1};\theta)\cdots f(x_{n};\theta)\). Treating function of \(\theta\) as if they were constants and looking at this joint p.d.f. as a function of \(\theta\), we denote it by \(L(\theta|x_{1},\cdots ,x_{n})\) and call it the likelihood function.
Definition. The estimate \(\hat{\theta}=\hat{\theta}(x_{1},\cdots ,x_{n})\) is called a maximum likelihood estimate (MLE) of \(\theta\) if
\[L(\hat{\theta}|x_{1},\cdots ,x_{n})=\max [L(\theta|x_{1},\cdots ,x_{n});\theta\in\Omega ]\]
$\hat{\theta}(X_{1},\cdots ,X_{n})$ is called a ML estimator (MLE) of \(\theta\). \(\sharp\)
Since the function \(y=\ln x\) for \(x>0\) is strictly increasing, in order to maximize \(L(\theta|x_{1},\cdots ,x_{n})\) with respect to \(\theta\), it suffices to maximize \(\ln L(\theta|x_{1},\cdots ,x_{n})\). \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{r}\) be multinomially distributed random variables with parameter \(\theta=(p_{1},\cdots,p_{r})\in\Omega\), where \(\Omega\) is the \((r-1)\)-dimensional hyperplane in \(\mathbb{R}^{r}\) defined by
\[\Omega =\left\{\theta=(p_{1},\cdots ,p_{r})\in\mathbb{R}^{r}:p_{j}>0, j=1,\cdots ,r\mbox{ and }\sum_{j=1}^{n} p_{j}=1\right\}.\]
Then, we have
\begin{align*} L(\theta|x_{1},\cdots ,x_{r}) & =\frac{n!}{\prod_{j=1}^{r}x_{j}!} p_{1}^{x_{1}}\cdots p_{r}^{x_{r}}\\
& =\frac{n!}{\prod_{j=1}^{r}x_{j}!} p_{1}^{x_{1}}\cdots p_{r-1}^{x_{r-1}}(1-p_{1}-\cdots -p_{r-1})^{x_{r}},\end{align*}
where \(n=\sum_{j=1}^{r} x_{j}\). We also have
\[\ln L(\theta|x_{1},\cdots ,x_{r}) & =\ln\frac{n!}{\prod_{j=1}^{r} x_{j}!}+x_{1}\ln p_{1}+\cdots+x_{r-1}\ln p_{r-1}+x_{r}\ln (1-p_{1}-\cdots -p_{r-1}).\]
Differentiating with respect to \(p_{j}\) for \(j=1,\cdots r-1\) and equating the resulting expressions to zero, we get
\[x_{j}\frac{1}{p_{j}}-x_{r}\frac{1}{p_{r}}=0,\]
which is equivalently to
\[\frac{x_{j}}{p_{j}}=\frac{x_{r}}{p_{r}}\]
for \(j=1,\cdots ,r-1\). Therefore, we obtain
\[\frac{x_{1}}{p_{1}}=\cdots =\frac{x_{r-1}}{p_{r-1}}=\frac{x_{r}}{p_{r}}.\]
This common value is equal to
\[\frac{x_{1}+\cdots +x_{r-1}+x_{r}}{p_{1}+\cdots +p_{r-1}+p_{r}}=\frac{n}{1}.\]
Hence \(x_{j}/p_{j}=n\) and \(p_{j}=x_{j}/n\) for \(j=1,\cdots ,r\). It can be seen that these values of the \(p\)’s actually maximize the likelihood function, and therefore \(\hat{p}_{j}=x{j}/n\) for \(j=1,\cdots ,r\) are the MLE’s of the \(p\)’s. \(\sharp\)
Example. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables from \(U(\alpha ,\beta )\). Here \(\theta=(\alpha ,\beta)\in \Omega\). Then, we have
\[L(\theta|x_{1},\cdots ,x_{n})=\frac{1}{(\beta -\alpha)^{n}}I_{[\alpha ,\infty )}(x_{(1)})I_{(-\infty ,\beta ]}(x_{(n)}).\]
The likelihood function is not differentiable with respect to \(\alpha\) and \(\beta\), but it is maximized when \(\beta-\alpha\) is minimum, subject to the conditions \(\alpha\leq x_{(1)}\) and \(\beta\geq x_{(n)}\). This happens when \(\hat{\alpha}=x_{(1)}\) and \(\hat{\beta}=x_{(n)}\). Therefore \(\hat{\alpha}=x_{(1)}\) and \(\hat{\beta}=x_{(n)}\) are the MLE’s of \(\alpha\) and \(\beta\), respectively. In particular, if \(\alpha =\theta -c\) and \(\beta =\theta +c\), where \(c\) is a given positive constant, then
\[L(\theta |x_{1},\cdots ,x_{n})=\frac{1}{(2c)^{n}}I_{\theta -c,\infty}(x_{(1)})I_{(-\infty ,\theta +c]}(x_{(n)}).\]
The likelihood function is maximized with maximum \(1/(2c)^{n}\) for any \(\theta\) satisfying \(\theta -c\leq x_{(1)}\) and \(\theta +c\geq x_{(n)}\). This shows that any statistic that lies between \(X_{(1)}+c\) and \(X_{(n)}-c\) is a MLE of \(\theta\). For example, \(\frac{1}{2}[X_{(1)}+X_{(n)}]\) is such a statistic and hence a MLE of \(\theta\). If \(\beta\) is known and \(\alpha = \theta\), or if \(\alpha\) is known and \(\beta =\theta\), then \(x_{(1)}\) and \(x_{(n)}\) are the MLE’s of \(\alpha\) and \(\beta\), respectively. \(\sharp\)
The MLE need not be UMVU. There may be more than one MLE as in the above example.
Theorem. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ; \theta)\) for \(\theta\in\Omega\subseteq\mathbb{R}^{r}\), and let \({\bf T}=(T_{1},\cdots ,T_{r})\), \(T_{j}=T_{j} (X_{1},\cdots ,X_{n})\) for \(j=1,\cdots ,r\) be a sufficient statistic for \(\theta=(\theta_{1},\cdots ,\theta_{r})\). Then if \(\hat{\theta}=(\hat{\theta}_{1},\cdots ,\hat{\theta}_{r} )\) is the unique MLE of \(\theta\), it follows that \(\theta\) is a function of \({\bf T}\).
Theorem. Let \(X_{1},\cdots ,X_{n}\) be i.i.d. random variables with p.d.f. \(f(\cdot ; \theta)\) for \(\theta\in\Omega\subseteq\mathbb{R}^{r}\), and let \(\phi\) be defined on \(\Omega\) into \(\Omega^{*} \subseteq\mathbb{R}^{m}\) and let it be one-to-one. Suppose \(\hat{\theta}\) is a MLE of \(\theta\). Then \(\phi (\hat{\theta})\) is a MLE of \(\phi (\theta)\). That is, a MLE is invariant under one-to-one transformations.
Definition. (Factorization). Let \(X_{1},X_{2},\cdots ,X_{n}\) denote random variables with joint p.d.f. \(f(x_{1},\cdots ,x_{n};\theta )\), which depends on the parameter \(\theta\). The statistic \(Y=u(X_{1},\cdots ,X_{n})\) is sufficient for \(\theta\) when
\[f(x_{1},\cdots ,x_{n};\theta )=\phi (u(x_{1},\cdots ,x_{n};\theta ))h(x_{1},\cdots ,x_{n}),\]
where \(\phi\) depends on \(x_{1},x_{2},\cdots ,x_{n}\) only through \(u(x_{1},\cdots ,x_{n})\) and the function \(h(x_{1},\cdots ,x_{n})\) does not depend on \(\theta\). \(\sharp\)
Note that the random variables \(X_{1},X_{2},\cdots ,X_{n}\) will be of a random sample and hence their joint p.d.f. will be of the form
\[f(x_{1},\cdots ,x_{n};\theta )=f(x_{1};\theta )f(x_{2};\theta )\cdots f(x_{n};\theta ).\]
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) denote a random sample from a Poisson distribution with parameter \(\lambda >0\). Then
\begin{align*} f(x_{1};\lambda )\cdots f(x_{n};\lambda ) & =\frac{\lambda^{\sum x_{i}}\cdot e^{-n\lambda}}{x_{1}!x_{2}!\cdots x_{n}!}\\ & =\lambda^{n\bar{x}}\cdots e^{-n\lambda}\left (\frac{1}{x_{1}!x_{2}!\cdots x_{n}!}\right ),\end{align*}
where \(\bar{x}=(1/n)\sum x_{i}\). It is clear that the sample mean \(\bar{X}\) is a sufficient statistic for \(\lambda\). It can easily be shown that the maximum likelihood estimator for \(\lambda\) is also \(\bar{X}\). It is quite obvious that the sum \(\sum X_{i}\) is also a sufficient statistic for \(\lambda\). \(\sharp\)
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from \(N(\mu ,1)\) for \(-\infty <\mu <\infty\). The joint p.d.f. of these random variables is
\begin{align*}
\frac{1}{(2\pi )^{n/2}}\exp\left [-\frac{1}{2}\sum_{i=1}^{n} (x_{i}-\mu )^{2}\right ] & =\frac{1}{(2\pi )^{n/2}}\exp\left [-\frac{1}{2}
\sum_{i=1}^{n}[(x_{i}-\bar{x})+(\bar{x}-\mu )]^{2}\right ]\\
& =\exp\left [-\frac{n}{2}(\bar{x}-\mu )^{2}\right ]\cdot\frac{1}{(2\pi )^{n/2}}\cdot\exp\left [-\frac{1}{2}\sum_{i=1}^{n} (x_{i}-\bar{x})^{2}\right ].
\end{align*}
We see that \(\bar{X}\) is a sufficient statistic for \(\mu\). Now \(\bar{X}^{3}\) is also sufficient for \(\mu\), since knowing \(\bar{X}^{3}\) is equivalent to having knowledge of the value of \(\bar{X}\). However, \(\bar{X}^{2}\) does not have this property, and it is not sufficient statistic for \(\mu\). \(\sharp\)
In general, we see that if \(Y\) is sufficient for a parameter \(\theta\), then every single-valued function of \(Y\), not involving \(\theta\) but with
a single-valued inverse, is also a sufficient statistic for \(\theta\). Again the reason is that knowing either \(Y\) or that function of \(Y\), we know the other. More formally, if \(W=v(Y)=v(u(X_{1},\cdots X_{n}))\) is that function and \(Y=v^{-1}(W)\) is the single-valued inverse, then the display of the factorization theorem can be written as
\[f(x_{1},\cdots ,x_{n};\theta )=\phi (v^{-1}(v(u(x_{1},\cdots ,x_{n})));\theta )h(x_{1},\cdots ,x_{n}).\]
The first factor of the right-hand member of this equation depends on \(x_{1},x_{2},\cdots ,x_{n}\) through \(v(u(x_{1},\cdots x_{n}))\), so
$W=v(u(X_{1},\cdots ,X_{n}))$ is a sufficient statistic for \(\theta\). One consequence of the sufficiency of a statistic \(Y\) is that the conditional probability of any given event \(A\) in the support of \(X_{1},X_{2},\cdots ,X_{n}\), given \(Y=y\), does not depend on \(\theta\). This is sometimes used as the definition of sufficiency.
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from a distribution with p.d.f.
\[f(x;p)=p^{x}(1-p)^{1-x}, x=0,1,\]
where the parameter \(p\) is \(0<p<1\). We know that
\[Y=X_{1}+X_{2}+\cdots +X_{n}\]
is \(B(n,p)\) and \(Y\) is sufficient for \(p\), since the joint p.d.f. of \(X_{1},X_{2},\cdots ,X_{n}\) is
\[p^{x_{1}}(1-p)^{1-x_{1}}\cdots p^{x_{n}}(1-p)^{1-x_{n}}=
\left (p^{\sum x_{i}}(1-p)^{n-\sum x_{i}}\right )\cdot (1),\]
where \(\phi (y;p)=p^{y}(1-p)^{y}\) and \(h(x_{1},\cdots ,x_{n})=1\). What then is the conditional probability
\[\mathbb{P}(X_{1}=x_{1},X_{2}=x_{2},\cdots ,X_{n}=x_{n}|Y=y),\]
where \(y=0,1,2,\cdots ,n-1\), or \(n\)? Unless the sum of nonnegative integers \(x_{1},x_{2},\cdots ,x_{n}\) equals \(y\), this conditional probability is obviously equal to zero, which does not depend on \(p\). Hence it is only interesting to consider the solution when \(y=x_{1}+\cdots +x_{n}\). From the definition of conditional probability we have
\begin{align*}
P(X_{1}=x_{1},X_{2}=x_{2},\cdots ,X_{n}=x_{n}) & =\frac{P(X_{1}=x_{1},\cdots ,X_{n}=x_{n})}{P(Y=y)}\\
& =\frac{p^{x_{1}}(1-p)^{1-x_{1}}\cdots p^{x_{n}}(1-p)^{1-x_{n}}}C_{y}^{n}p^{y}(1-p)^{n-y}=\frac{1}{C^{n}_{y}},
\end{align*}
where \(y=x_{1}+\cdots +x_{n}\). Since \(y\) equals the number of ones in the collection \(x_{1},x_{2},\cdots ,x_{n}\), this answer is only the probability of selecting a particular arrangemen, namely \(x_{1},x_{2},\cdots ,x_{n}\) of \(y\) ones and \(n-y\) zeros, and does not depend on the parameter \(p\). That is, given that the sufficient statistic \(Y=y\), the conditional probability of \(X_{1}=x_{1},X_{2}=x_{2},\cdots ,X_{n}=x_{n}\) does not depend on the parameter \(p\).
Theorem. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from a distribution with a p.d.f. of the exponential form
\[f(x;\theta )=\exp\left [K(x)p(\theta )+S(x)+q(\theta )\right ]\]
on a support free of \(\theta\). The statistic \(\sum_{i=1}^{n} K(X_{i})\) is sufficient for \(\theta\).
Proof. The joint p.d.f. of \(X_{1},X_{2},\cdots ,X_{n}\) is given by
\[\exp\left [p(\theta )\sum_{i=1}^{n} K(x_{i})+\sum_{i=1}^{n} S(x_{i})+nq(\theta )\right ]=\exp\left [p(\theta )\sum_{i=1}^{n} K(x_{i})+nq(\theta )\right ]\cdot\exp\left [\sum_{i=1}^{n} S(x_{i})\right ].\]
In accordance with the factorization, the statistic \(\sum_{i=1}^{n} K(X_{i})\) is sufficient for \(\theta\). \(\blacksquare\)
From the above examples, we have, respectively,
\begin{align*}
\frac{e^{-\lambda}\lambda^{x}}{x!} & =\exp (x\ln\lambda -\ln x!-\lambda )\mbox{ for }x=0,1,2,\cdots ,\\
\frac{1}{\sqrt{2\pi}}\exp\left (-\frac{(x-\mu )^{2}}{2}\right ) & =\exp\left [x\mu -\frac{x^{2}}{2}-\frac{1}{2}\ln (2\pi )\right ]\mbox{ for } -\infty <x<\infty,\\
p^{x}(1-p)^{1-x} & =\exp\left [x\ln\left (\frac{p}{1-p}\right )+\ln (1-p)\right ]\mbox{ for }x=0,1.
\end{align*}
In each of these examples, the sum \(\sum X_{i}\) of the observations of the random sample was the sufficient statistic for the parameter.
Example. Let \(X_{1},X_{2},\cdots ,X_{n}\) be a random sample from an exponential distribution with p.d.f.
\begin{align*} f(x;\theta ) & =\frac{1}{\theta}e^{-x/\theta}\\ & =\exp\left [x\left (-\frac{1}{\theta}\right )-\ln\theta\right ]\end{align*}
for \(0<x<\infty \)latex , provided \(0<\theta <\infty\). Since \(K(x)=x\), the sume \(\sum_{i=1}^{n} X_{i}\) is sufficient for \(\theta\). It follows that \(\bar{X}=(1/n)\sum_{i=1}^{n} X_{i}\) is also sufficient for \(\theta\). \(\sharp\)


