Statistical Validation Techniques

Jan Griffier (1652–1718) was a Dutch Golden Age painter.

We have sections

\begin{equation}{\label{a}}\tag{A}\mbox{}\end{equation}

Goodness of Fit Tests.

One often begins a probabilistic analysis of a given phenomenon by hypothesizing that certain of its random elements have a particular probability distribution. Such hypotheses can be statistically tested by observing data and then seeing whether the assumption of a particular probability distribution is consistent with the data. These statistical tests are called goodness of fit tests.

The Chi-Square Goodness of Fit Test for Discrete Data.

Suppose that \(n\) independent random variables \(Y_{1},\cdots ,Y_{n}\) each taking on one of the values \(1,2,\cdots ,k\), are to be observed, and that we are interested in testing the hypothesis that \(\{p_{i}:i=1,\cdots ,k\}\) is the probability mass function of these random variables. That is, \(Y\) represents any of the \(Y_{j}\), the hypothesis to be tested, which we denote by \(H_{0}\) and refer to as the null hypothesis, is \(H_{0}:P\{Y=i\}=p_{i}\)for \(i=1,\cdots ,k\). To test the foregoing hypothesis, let \(N_{i}\), \(i=1,\cdots ,k\), denote the number of the \(Y_{j}\)’s that equal \(i\). Because each \(Y_{j}\) independently equals \(i\) with probability \(P\{Y=i\}=p_{i}\), it follows that, under \(H_{0}\), \(N_{i}\) is binomial with parameters \(n\) and \(p_{i}\). Hence, when \(H_{0}\) is true, we have \(\mathbb{E}(N_{i})=np_{i}\), which says  that \((N_{i}-np_{i})^{2}\) is an indication as to how likely it appears that \(p_{i}\) indeed equals the probability that \(Y=i\). When this is large, say, in relation to \(np_{i}\), then it is an indication that \(H_{0}\) is not correct. Indeed, such reasoning leads us to consider the quantity

\[T=\sum_{i=1}^{k} \frac{(N_{i}-np_{i})^{2}}{np_{i}}\]

and to reject the null hypothesis when \(T\) is large. Whereas small values of the test quantity \(T\) are evidence in favor of the hypothesis \(H_{0}\), large ones are indicative of its falsity. Suppose now that the actual data result in the test quantity \(T\) taking on the value \(t\). To see how unlikely such a large outcome would have been if the null hypothesis had been true, we define the so-called \(p\)-value by

\[p\mbox{-value}=\mathbb{P}_{H_{0}}\{T\geq t\}\]

where we have used the notation \(\mathbb{P}_{H_{0}}\) to indicate that the probability is to be computed under the assumption that \(H_{0}\) is correct. Therefore, \(p\)-value gives the probability that such a large value of \(T\) as the one observed would have occurred if the null hypothesis were true. It is typical to reject the null hypothesis, saying that it appears to be inconsistent with the data, when a small \(p\) value results (a value less than \(0.05\), or more conservatively, \(0.01\) is usually taken to be critical) and to accept the null hypothesis, saying that it appears to be consistent with the data, otherwise. After observing the value, call it \(t\), of the test quantity, it thus remains to determine the probability

\[p\mbox{-value}=\mathbb{P}_{H_{0}}\{T\geq t\}.\]

A reasonably good approximation to this probability can be obtained by using the classical result that, for large values of \(n\), \(T\) has approximately a chi-square distribution with \(k-1\) degrees of freedom when \(H_{0}\) is true. Hence

\begin{equation}{\label{sch5eq1}}\tag{1}
p\mbox{-value}\approx \mathbb{P}\{\chi^{2}_{k-1}\geq t\}
\end{equation}

where \(\chi^{2}_{k-1}\) is a chi-square random variable with \(k-1\) degrees of freedom.

Example. Consider a random quantity which can take on any of the possible values \(1,2,3,4,5\), and suppose we want to test the hypothesis that these values are equally likely to occur. That is, we want to test \(H_{0}:p_{i}=0.2\) for \(i=1,2,3,4,5\). If a sample of size \(50\) yielded the values of \(N_{i}=12,5,19,7,7\) then the approximate \(p\)-value is obtained as follows. The value of the test statistic \(T\) is given by

\[T=\frac{4+25+81+9+9}{10}=12.8\]

Now we have

\[p\mbox{-value}\approx \mathbb{P}\{\chi^{2}_{4}>12.8\}=0.0122\]

For such a low \(p\)-value the hypothesis that all outcomes are equally likely would be rejected. \(\sharp\)

If the \(p\)-value approximation given by (\ref{sch5eq1}) is not too small, say, of the order of \(0.15\) or larger, then it is clear that the null hypothesis is not going to be rejected, and so there is no need to look for a better approximation. However, when the \(p\)-value is closer to a critical value (such as \(0.05\) or \(0.01\)) we would probably want a more accurate estimate of its value than the one given by the chi-square approximation distribution. Fortunately, a more accurate estimator can be obtained via a simulation study. The simulation approach to estimate the \(p\)-value of the outcome \(T=t\) is as follows. To determine the probability that \(T\) would have been at least as large as \(t\) when \(H_{0}\) is true, we generate \(n\) independent random variables \(Y_{11},\cdots ,Y_{1n}\), each having the probability mass function \(\{p_{i}:i=1,\cdots ,k\}\), that is, \(\mathbb{P}\{Y_{1j}=i\}=p_{i}\) for \(i=1,\cdots ,k\) and \(j=1,\cdots ,n\). Let

\[N_{1i}=\mbox{number \(j\): \(Y_{1j}=i\)}\]

and set

\[T_{1}=\sum_{i=1}^{k} \frac{(N_{1i}-np_{i})^{2}}{np_{i}}.\]

Now repeat this procedure by simulating a second set, independent of the first set, of \(n\) independent random variables \(Y_{21},\cdots ,Y_{2n}\) each having probability mass function \(\{p_{i}:i=1,\cdots ,k\}\) and then, as for the first set, determining \(T_{2}\). Repeating this a large number of times, say \(r\), yields \(r\) independent random variables \(T_{1},\cdots , T_{r}\), each of which has the same distribution as does the test statistic \(T\) when \(H_{0}\) is true. Hence, by the law of large numbers, the proportion of the \(T_{i}\) that are as large as \(t\) will be a very nearly equal to the probability that \(T\) is as large as \(t\) when \(H_{0}\) is true, that is,

\[\frac{\mbox{number \(l\): \(T_{l}\geq t\)}}{r}\approx \mathbb{P}_{H_{0}}(T\geq t\}.\]

The Kolmogorov-Smirnov Test for Continuous Data.

We consider the situation where \(Y_{1},\cdots ,Y_{n}\) are independent random variables, and we are interested in testing the null hypothesis \(H_{0}\) that they have the common distribution function \(F\), where \(F\) is a given continuous distribution function. One approach to testing \(H_{0}\) is to break up the set of possible values of the \(Y_{j}\) into \(k\) distinct intervals, say

\[(y_{0},y_{1}), (y_{1},y_{2}),\cdots ,(y_{k-1},y_{k}),\]

where \(y_{0}=-\infty\) and \(y_{k}=+\infty\).  Then, we consider the discretized random variables \(Y_{j}^{D}\) for \(j=1,\cdots ,n\), defined by \(Y_{j}^{D}=i\) if \(Y_{j}\) lies in the interval \((y_{i-1},y_{i})\). Then, the null hypothesis implies

\begin{align*} \mathbb{P}\{Y_{j}^{D}=i\} & =\mathbb{P}\{y_{i-1}<Y_{j}<y_{i}\}\\ & =F(y_{i})-F(y_{i-1})\end{align*}

for \(i=1,\cdots ,k\).  This can be tested by the chi-square goodness of fit test already presented in the above discussion.

There is another way of testing that \(Y_{j}\) come from the continuous distribution function \(F\) which is generally more efficient than discretization. It works as follows. After observing \(Y_{1},\cdots ,Y_{n}\), let \(F_{e}\) be the empirical distribution function defined by

\[F_{e}(x)=\frac{\mbox{number \(i\): \(Y_{i}\leq x\)}}{n}.\]

That is, \(F_{e}(x)\) is the proportion of the observed values that are less than or equal to \(x\). Because \(F_{e}(x)\) is a natural estimator of the probability that an observation is less than or equal to \(x\), it follows that, if the null hypothesis that \(F\) is the underlying distribution is correct, it should be close to \(F(x)\). Since this is so for all \(x\), a natural quantity on which to base a test of \(H_{0}\) is the test quantity \(D=\max_{x}|F_{e}(x)-F(x)|\), where the maximum (the more mathematically sophisticated readers will recognize that technically we should have written supremum rather than maximum) is over all values of \(x\) from \(-\infty\) to \(+\infty\). The quantity is called the Kolmogorov-Smirnov test statistic. To compute the value of \(D\) for a given data set \(Y_{j}=y_{j}\) for \(j=1,\cdots ,n\), let \(y_{(1)},y_{(2)},\cdots ,y_{(n)}\) denote the values of the \(y_{j}\) in increasing order. That is,

\[y_{(j)}=\mbox{$j$th smallest of \(y_{1},\cdots ,y_{n}\)}\]

For example, if \(n=3\) and \(y_{1}=3\), \(y_{2}=5\), \(y_{3}=1\), then \(y_{(1)}=1\), \(y_{(2)}=3\), \(y_{(3)}=5\). Since \(F_{e}(x)\) can be written as

\[F_{e}(x)=\left\{\begin{array}{ll}
0 & \mbox{if \(x<y_{(1)}\)}\\
1/n & \mbox{if \(y_{(1)}\leq x<y_{(2)}\)}\\
\vdots &\\
j/n & \mbox{if \(y_{(j)}\leq x<y_{(j+1)}\)}\\
\vdots &\\
1 & \mbox{if \(y_{(n)}\leq x\)}
\end{array}\right .\]

we see that \(F_{e}(x)\) is constant within the intervals \((y_{(j-1)},y_{(j)})\) and then jumps by \(1/n\) at the points \(y_{(1)},y_{(2)},\cdots ,y_{(n)}\). Since \(F(x)\) is an increasing function of \(x\) which is bounded by \(1\), it follows that the maximum (supremum) value of \(F_{e}(x)-F(x)\) is nonnegative and occurs at one of the points \(y_{(j)}\), \(j=1,\cdots ,n\). That is, we have

\begin{equation}{\label{sch5eq2}}\tag{2}
\max_{x}(F_{e}(x)-F(x))=\max_{j=1,\cdots ,n}\left (\frac{j}{n}-F(y_{(j)})\right )
\end{equation}

Similarly, the maximum (supremum) value of \(F(x)-F_{e}(x)\) is also nonnegative and occurs immediately before one of the jump points \(y_{(j)}\), which says

\begin{equation}{\label{sch5eq3}}\tag{3}
\max_{x}(F(x)-F_{e}(x))=\max_{j=1,\cdots ,n}\left (F(y_{(j)})-\frac{j-1}{n}\right ).
\end{equation}

From equations (\ref{sch5eq2}) and (\ref{sch5eq3}), we have

\begin{align}
D & =\max_{x}|F_{e}(x)-F(x)|\nonumber\\
& =\max\{\max_{x}(F_{e}(x)-F(x)),\max_{x}(F(x)-F_{e}(x))\}\nonumber\\
& =\max\left\{\frac{j}{n}-F(y_{(j)}),F(y_{(j)})-\frac{j-1}{n}:j=1,\cdots ,n\right\}\label{sch5eq5}\tag{4}
\end{align}

Suppose that the \(Y_{j}\) are observed and their values are such that \(D=d\). Since a large value of \(D\) would appear to be inconsistent with the null hypothesis that \(F\) is the underlying distribution, it follows that the \(p\)-value for this data is given by

\[p\mbox{-value}=\mathbb{P}_{F}\{D\geq d\}\]

where we have written \(\mathbb{P}_{F}\) to make explicit that this probability is to be computed under the assumption that \(H_{0}\) is correct (and so \(F\) is the underlying distribution). The above \(p\)-value can be approximated by a simulation that is made easier by the following proposition, which shows that \(\mathbb{P}_{F}\{D\geq d\}\) does not depend on the underlying distribution \(F\). This result enables us to estimate the \(p\)-value by doing the simulation with any continuous distribution \(F\) we choose (thus allowing us to use the uniform \((0,1)\) distribution).

Proposition. \(\mathbb{P}_{F}\{D\geq d\}\) is the same for any continuous distribution \(F\).

Proof. We have

\begin{align*}
\mathbb{P}_{F}\{D\geq d\} & =\mathbb{P}_{F}\left\{\max_{x}\left |\frac{\mbox{number \(i\):$Y_{i}\leq x$}}{n}-F(x)\right |\geq d\right\}\\
& =\mathbb{P}_{F}\left\{\max_{x}\left |\frac{\mbox{number \(i\): \(F(Y_{i})\leq F(x)\)}}{n}-F(x)\right |\geq d\right\}\\ & \quad\mbox{(since \(F\) is an increasing function)}\\
& =\mathbb{P}\left\{\max_{x}\left |\frac{\mbox{number \(i\): \(U_{i}\leq F(x)\)}}{n}-F(x)\right |\geq d\right\}\\
& \quad\mbox{(if \(Y\) has the continuous distribution \(F\) then \(F(Y)\) is uniform on \((0,1)\))}\\
& =\mathbb{P}\left\{\max_{0\leq y\leq 1}\left |\frac{\mbox{number \(i\): \(U_{i}\leq y\)}}{n}-y\right |\geq d\right\}\\ \mbox{(since \(0\leq y=F(x)\leq 1\))}
\end{align*}

which shows that the distribution of \(D\), when \(H_{0}\) is true, does not depend on the actual distribution \(F\). \(\blacksquare\)

It follows from the preceding proposition that after the value of \(D\) is determined from the data, say, \(D=d\), the \(p\)-value can be obtained by doing a simulation with the uniform \((0,1)\) distribution. That is, we generate a set of \(n\) random numbers \(U_{1},\cdots ,U_{n}\) and then check whether or not the inequality

\begin{equation}{\label{sch5eq4}}\tag{5}
\max_{0\leq y\leq 1}\left |\frac{\mbox{number \(i\): \(U_{i}\leq y\)}}{n}-y\right |\geq d
\end{equation}

is valid. This is repeated many times and the proportion of times that it is valid is our estimate of the \(p\)-value of the data set. As noted earlier, the left side of the inequality (\ref{sch5eq4}) can be computed by ordering the random numbers and then using the identity

\[\max_{0\leq y\leq 1}\left |\frac{\mbox{number \(i\): \(U_{i}\leq y\)}}{n}-y\right |=\max\left\{\frac{j}{n}-U_{(j)},U_{(j)}-\frac{j-1}{n}:j=1,\cdots ,n\right\}\]

where \(U_{(j)}\) is the \(j\)th smallest value of \(U_{1},\cdots ,U_{n}\). For example, if \(n=3\) and \(U_{1}=0.7\), \(U_{2}=0.6\), \(U_{3}=0.4\), then \(U_{(1)}=0.4\), \(U_{(2)}=0.6\), \(U_{(3)}=0.7\) and the value of \(D\) for this data set is

\[D=\max\left\{\frac{1}{3}-0.4,\frac{2}{3}-0.6,1-0.7,0.4,0.6-\frac{1}{3},0.7-\frac{2}{3}\right\}=0.4.\]

Example. We want to test the hypothesis that a given population distribution is exponential with mean \(100\); that is, \(F(x)=1-e^{x/100}\). If the ordered values from a sample of size \(10\) from this distribution are \(66,72,81,94,112,116,124,140,145,155\). What conclusion can be drawn? To answer the above, we first employ equation (\ref{sch5eq5}) to compute the value of the Kolmogorov-Smirnov test quantity \(D\). After some computation this gives the result \(D=0.4831487\). Then, we obtain the approximate \(p\)-value$=0.012$ using simulation. Because the \(p\)-value is so low (it is extremely unlikely that the smallest of a set of \(10\) values from the exponential distribution with mean \(100\) would be as large as \(66\)), the hypothesis would be rejected. \(\sharp\)

Goodness of Fit Tests when Some Parameters are Unspecified.

The Discrete Data Case.

We can also perform a goodness of fit test of a null hypothesis that does not completely specify the probabilities \(\{p_{i}:i=1,\cdots ,k\}\). For example, suppose we are interested in testing whether the daily number of customers in a certain company has a Poisson distribution with some unspecified mean. To test this hypothesis, suppose that data are obtained over \(n\) days and let \(Y_{i}\) represent the number of customers in day \(i\), for \(i=1,\cdots ,n\). To determine whether these data are consistent with the assumption of an underlying Poisson distribution, we must first address the difficulty that, if the Poisson assumption is correct, these can assume an infinite number of possible values. However this is accomplished by breaking up the set of possible values into a finite number of, say \(k\), companies and then seeing in which of the companies the \(n\) data points lie. For instance, if the company of interest is small, and so there are not too many customers in a day, we might say that the number of customers in a given day falls in company \(i\) for \(i=1,2,3,4,5\), when there are \(i-1\) customers on that day, and in company \(6\) when there are \(5\) or more customers. Hence, if the underlying distribution is indeed Poisson with mean \(\lambda\), then

\begin{equation}{\label{sch5eq6}}\tag{6}
p_{i}=\mathbb{P}\{Y=i-1\}=\frac{e^{-\lambda}\lambda^{i-1}}{(i-1)!}\mbox{ for }i=1,2,3,4,5
\mbox{ and }p_{6}=1-\sum_{j=0}^{4}\frac{e^{-\lambda}\lambda^{j}}{j!}.
\end{equation}

Another difficulty we face in obtaining a goodness of fit test of the hypothesis that the underlying distribution is Poisson is that the mean value \(\lambda\) is not specified. Now, the intuitive thing to do when \(\lambda\) is unspecified is clearly to estimate its value from the data, called \(\widehat{\lambda}\) the estimate, and then compute the value of the test statistic

\[T=\sum_{i=1}^{k} \frac{(N_{i}-n\widehat{p}_{i})^{2}}{n\widehat{p}_{i}}\]

where \(N_{i}\) is the number of the \(Y_{j}\) that fall in company \(i\), and where \(\widehat{p}_{i}\) is the estimated probability, under \(H_{0}\), that \(Y_{j}\) falls in company \(i\), \(i=1,\cdots ,k\), which is obtained by substituting \(\widehat{\lambda}\) for \(\lambda\) in the expression (\ref{sch5eq6}). The above approach can be used whenever there are unspecified parameters in the null hypothesis that are needed to compute the quantities \(p_{i}\) for  \(i=1,\cdots ,k\). Suppose now that there are \(m\) such unspecified parameters. It can be proved that, for reasonable estimators of these parameters, when \(n\) is large the test quantity \(T\) has, when \(H_{0}\) is true, approximately a chi-square distribution with \(k-1-m\) degrees of freedom. In other words, one degree of freedom is lost for each parameter that needs to be estimated. If the test quantity takes on the value, say \(T=t\), then using the above, the \(p\)-value can be approximated by

\[p\mbox{-value}\approx \mathbb{P}\{\chi^{2}_{k-1-m}\geq t\}\]

where \(\chi^{2}_{k-1-m}\) is a chi-square random variable with \(k-1-m\) degrees of freedom.

\begin{equation}{\label{sch5ex10}}\tag{7}\mbox{}\end{equation}

Example \ref{sch5ex10}. Suppose that over a \(30\)-day period there are \(6\) days in which no customers occurred, \(2\) days in which \(1\) customers occurred, \(1\) day in which \(2\) customers occurred, \(9\) days in which \(3\) customers occurred, \(7\) days in which \(4\) customers occurred, \(4\) days in which \(5\) customers occurred, and \(1\) day in which \(8\) customers occurred. To test whether the these data are consistent with the hypothesis of an underlying Poisson distribution, note first that since there are a total \(87\) customers, the estimate of the mean of the Poisson distribution is \(\widehat{\lambda}=\frac{87}{30}=2.9\). Since the estimate of \(P\{Y=i\}\) is thus \(e^{-2.9}(2.9)^{i}/i!\), we obtain that with the six companies as given at the beginning of this section

\[\begin{array}{l}
\widehat{p}_{1}=0.0500, \widehat{p}_{2}=0.1596, \widehat{p}_{3}=0.2312\\
\widehat{p}_{4}=0.2237, \widehat{p}_{5}=0.1622, \widehat{p}_{6}=0.1682
\end{array}\]

Using the data values \(N_{1}=6\), \(N_{2}=2\), \(N_{3}=1\), \(N_{4}=9\), \(N_{5}=7\), \(N_{6}=4+1=5\), we see that the value of the test statistic is

\[T=\sum_{i=1}^{n}\frac{(N_{i}-30\widehat{p}_{i})^{2}}{30\widehat{p}_{i}}=19.887.\]
Then, we obtain the approximate \(p\)-value

\[p\mbox{-value}\approx \mathbb{P}\{\chi^{2}_{4}>19.887\}=0.0005,\]

which says that the hypothesis of an underlying Poisson distribution is rejected. \(\sharp\)

We can also use simulation to estimate the \(p\)-value. However, since the null hypothesis is no longer completely specifies the probability model, the use of simulation to determine the \(p\)-value of the test statistic is somewhat tricker than before. The way it should be done is as follows.

(i) The Model. Suppose that the null hypothesis is that the data values \(Y_{1},\cdots ,Y_{n}\) constitute a random sample from a distribution that is specified up to a set of unknown parameters \(\theta_{1},\cdots ,\theta_{m}\). Suppose also that when this hypothesis is true, the possible values of the \(Y_{i}\) are \(1,\cdots ,k\).

(ii) The Initial Step. Use the data to estimate the unknown parameters. Specifically, let \(\widehat{\theta}_{j}\) denote the value of the estimator of \(\theta_{j}\), \(j=1,\cdots ,m\). Now, we compute the value of the test statistic

\[T=\sum_{i=1}^{k} \frac{(N_{i}-n\widehat{p}_{i})^{2}}{n\widehat{p}_{i}}\]

where \(N_{i}\) is the number of the data values that are equal to \(i\) for \(i=1,\cdots ,k\), and \(\widehat{p}_{i}\) is the estimate of \(p_{i}\) that results when \(\widehat{\theta}_{j}\) is substituted for \(\theta_{j}\) for \(j=1,\cdots ,m\). Let \(t\) denote the value of the test quantity \(T\).

(iii) The Simulation Step. We now do a series of simulation to estimate the \(p\)-value of the data. First note that all simulations are to be obtained by using the population distribution that results when the null hypothesis is true and \(\theta_{j}\) is equal to the estimate \(\widehat{\theta}_{j}\) for \(j=1,\cdots ,m\), determined in Step (ii). Simulate a sample of size \(n\) from the aforementioned population distribution and let \(\widehat{\theta}_{j}^{sim}\) denote the estimate of \(\theta_{j}\) for \(j=1,\cdots ,m\), based on the simulated data. Now, we determine the value of

\[T_{sim}=\sum_{i=1}^{k}\frac{(N_{i}-n\widehat{p}_{i}^{sim})^{2}}{n\widehat{p}_{i}^{sim}}\]

where \(N_{i}\) is the number of the simulated data values equal to \(i\) for \(i=1,\cdots ,k\), and \(\widehat{p}_{i}^{sim}\) is the estimated value of \(p_{i}\) when \(\theta_{j}\) is substituted by \(\widehat{\theta}_{j}^{sim}\) for \(j=1,\cdots ,m\).

The simulation step should then be repeated many times. The estimate of the \(p\)-value is then equal to the proportion of the values of \(T_{sim}\) that are at least as large as \(t\).

Example. Let us reconsider Example \ref{sch5ex10}. The data presented in this example resulted in the estimate \(\widehat{\lambda}=2.9\) and the test quantity value \(T=19.887\). The simulation step now consists of generating \(30\) independent Poisson random variables each having mean \(2.9\) and then computing the value of

\[T^{*}=\sum_{i=1}^{6}\frac{(N_{i}-30p_{i}^{*})^{2}}{30p_{i}^{*}}\]

where \(N_{i}\) is the number of the \(30\) values that falls into company \(i\), and \(p_{i}^{*}\) is the probability that a Poisson random variable with a mean equal to the average of the \(30\) generated values would fall into company \(i\). This simulation step should be repeated many times, and the estimated \(p\) value is the proportion of times it results in a \(T^{*}\) at least as large as \(19.887\). \(\sharp\)

The Continuous Data Case.

We consider the situation where we want to test the hypothesis that the random variables \(Y_{1},\cdots ,Y_{n}\) have the continuous distribution function \(F_{\theta}\), where \(\theta=(\theta_{1},\cdots ,\theta_{m})\) is a vector of unknown parameters. For example, we might be interested in testing that \(Y_{j}\) come from a normally distributed population. To employ the Kolmogorov-Smirnov test, we first use the data to estimate the parameter vector \(\theta\), say, by the vector of estimators \(\widehat{\theta}\). The value of the test statistic \(D\) is now computed by \(D=\max_{x}|F_{e}(x)-F_{\theta}(x)|\), where \(F_{\theta}(x)\) is the distribution function obtained from \(F_{\theta}\) when \(\theta\) is estimated by \(\widehat{\theta}\). If the value of the test quantity is \(D=d\), then the \(p\)-value can be roughly approximately by

\[\mathbb{P}_{F_{\theta}}\{D\geq d\}=\mathbb{P}_{U}\{D\geq d\}.\]

That is, after determining the value of \(D\), a rough approximation, which actually overestimates the \(p\)-value, is obtained. If this does not result in a small estimate for the \(p\)-value, then, as the hypothesis is not going to be rejected, we might as well stop. However, if this estimated \(p\)-value is small, then a more accurate way of using simulation is to estimate the true \(p\)-value is necessary. We now describe how this can be done.

(i) Use the data to estimate \(\theta\), say, by \(\widehat{\theta}\). Compute the value of \(D\) as described previously.

(ii) All simulations are to be done using the distribution \(F_{\widehat{\theta}}(x)\). Generate a sample size \(n\) from this distribution and let \(\widehat{\theta}^{sim}\) be the estimate of \(\theta\) based on this simulation run. Compute the value of

\[\max_{x}|F_{e,sim}(x)-F_{\widehat{\theta}^{sim}}(x)|\]

where \(F_{e,sim}\) is the empirical distribution function of the simulated data; and note whether it is at least as large as \(d\). Repeat this many times and use the proportion of times that this test quantity is at least as large as \(d\) as the estimate of the \(p\)-value.

\begin{equation}{\label{b}}\tag{B}\mbox{}\end{equation}

The Two-Sample Problem.

Suppose that we have formulated a mathematical model for a service system which clears all its customers at the end of a day; also suppose that our model assumes that each day is probabilistically alike in that the probability laws for each day are identical and independent. Some of the individual assumptions of the model, such as, for example, that the service times are all independent with the common distribution \(G\), or that the arrivals of customers constitute a Poisson process, can be individually tested by using the results of the above sections. Suppose that none of these individual tests results in a particular small \(p\)-value and so all the parts of the model, taken individually, do not appear to be inconsistent with the real data we have about the system. (We must be careful here in what we mean by a small \(p\)-value because, even if the model is correct, if we perform a large number of tests then, by chance, some of the resulting \(p\)-values may be small. For example, if we perform \(r\) separate tests on the independent data, then the probability that at least one of the resulting \(p\)-values is as small as \(\alpha\) is \(1-(1-\alpha )^{r}\), which even for small \(\alpha\) will become large as \(r\) increases.) At this stage, however, we are still not justified in asserting that our model is correct and has been validated by the real data; for the totality of the model, including not only all the individual parts but also our assumptions about the ways in which these parts interact, may still be inaccurate. One way of testing the model in its entirety is to consider some random quantity that is complicated function of the entire model. For example, we could consider the total amount of waiting time of all customers that enter the system in a given day. Suppose that we have observed the real system for \(m\) days and let \(Y_{i}\), \(i=1,\cdots ,m\), denote the sum of these waiting times for day \(i\). If we now simulate the proposed mathematical model for \(n\) days, we can let \(X_{i}\), \(i=1,\cdots , n\), be the sum of the waiting times of all customers arriving on the (simulated) day \(i\). Since the mathematical model supposes that all days are probabilistically alike and independent, it follows that all the random variables \(X_{1},\cdots ,X_{n}\) have some common distribution \(F\). That is, if the mathematical model is an accurate representation of the real system, then the real data \(Y_{1},\cdots ,Y_{m}\) should not be able to tell the simulated data apart from the real data. From this it follows that one way of testing the accuracy of the model in its entirety is to test the null hypothesis \(H_{0}\) that \(X_{1},\cdots ,X_{n},Y_{1},\cdots ,Y_{m}\) are independent random variables having a common distribution. We now show how such a hypothesis can be tested.

Suppose we have two sets of data, \(X_{1},\cdots ,X_{n}\) and \(Y_{1},\cdots ,Y_{m}\), and we want to test the hypothesis \(H_{0}\) that these \(n+m\) random variables are all independent and identically distributed. This statistical hypothesis testing problem is called the two-sample problem. To test \(H_{0}\), order the \(n+m\) values \(X_{1},\cdots ,X_{n},Y_{1},\cdots ,Y_{m}\) and suppose for the time being that all \(n+m\) values are all distinct and so the ordering is unique. Now for \(i=1,\cdots ,n\), let \(R_{i}\) denote the rank of \(X_{i}\) among the \(n+m\) data values; that is, \(R_{i}=j\) if \(X_{i}\) is the \(j\)th smallest among the \(n+m\) values. The quantity \(R=\sum_{i=1}^{n} R_{i}\) equal to the sum of the ranks of the first data set, is used as our test quantity. Either of the two data sets can be considered as the “first” set. If \(R\) is either very large (indicating that the first data set tends to be larger than the second) or very small (indicating the reverse), this would be strong evidence against the null hypothesis. Specifically, if \(R=r\), we reject the null hypothesis if either \(\mathbb{P}_{H_{0}}\{R\leq r\}\) or \(\mathbb{P}_{H_{0}}\{R\geq r\}\) is very low. Indeed, the \(p\)-value of the test data which results in \(R=r\) is given by

\begin{equation}{\label{sch5eq12}}\tag{8}
p\mbox{-value}=2\min\{\mathbb{P}_{H_{0}}\{R\leq r\},\mathbb{P}_{H_{0}}\{R\geq r\}\}
\end{equation}

It is twice the minimum of the probabilities because we reject either if \(R\) is too large or too small. For example, suppose \(r_{*}\) and \(r^{*}\) were such that the probability, under \(H_{0}\), of obtaining a value less (greater) than or equal to \(r_{*}\) ($r^{*}$) is \(0.05\). Since the probability of either event occurring is, under \(H_{0}\), \(0.1\) it follows that if the outcome is \(r_{*}\) (or \(r^{*}\)) the \(p\)-value is \(0.1\). The hypothesis test resulting from the above \(p\)-value, that is, the test that calls for rejection of the null hypothesis when \(p\)-value is sufficiently small, is called the {\bf two-sample rank sum test}. Other names that have also been used to designate this test are the Wilcoxon two-sample test and the Mann-Whitney two-sample test.

Example. Suppose that direct observation of a system over \(5\) days has yielded that a certain quantity has taken on the successive values \(342,448,504,361,453\) whereas a \(10\)-day simulation of a mathematical model proposed for the system has resulted in the following values \(186,220,225,456,276,199,371,426,242,311\). Because the five data values from the first set have ranks \(8,12,15,9,13\), it follows that the value of the test quantity is \(R=57\). \(\sharp\)

We can explicitly compute the \(p\)-value given in (\ref{sch5eq12}) when \(n\) and \(m\) are not too large and all the data distinct. To do so, let

\[P_{n,m}(r)=\mathbb{P}_{H_{0}}\{R\leq r\}\]

Hence \(P_{n,m}(r)\) is the probability that from two identically distributed data sets of sizes \(n\) and \(m\), the sum of the ranks of the data values from the first set is less than or equal to \(r\). We can obtain a recursive equation for these probabilities by conditioning in whether the largest data value comes from the first or the second set. If the largest value is indeed contained in the first data set, the sum of the ranks of this set equals \(n+m\) (the rank of the largest rank) plus the sum of the ranks of the other \(n-1\) values from this set when considered along with the \(m\) values from the other set. Hence, when the largest is contained in the first data set, the sum of the ranks of that set is less than or equal to \(r\) if the sum of the ranks of the remaining \(n-1\) elements is less than or equal to \(r-n-m\), and this is true with probability \(P_{n-1,m}(r-n-m)\). By a similar argument we can show that if the largest value is contained in the second set, the sum of the ranks of the first set is less than or equal to \(r\) with probability \(P_{n,m-1}(r)\). Finally, since the largest value is equally likely to be any of the \(n+m\) values, it follows that it is a member of the first set with probability \(n/(n+m)\). Putting this together yields the following recursive equation

\begin{equation}{\label{sch5eq14}}\tag{9}
P_{n,m}(r)=\frac{n}{n+m}P_{n-1,m}(r-n-m)+\frac{m}{n+m}P_{n,m-1}(r)
\end{equation}

Starting with the boundary conditions

\[P_{1,0}(k)=\left\{\begin{array}{ll} 0 & k\leq 1\\ 1 & k>0\end{array}\right .
\mbox{ and }P_{0,1}(k)=\left\{\begin{array}{ll} 0 & k<0\\ 1 & k\geq 0
\end{array}\right .\]

Equation (\ref{sch5eq14}) can be recursively solved to obtain

\[P_{n,m}(r)=\mathbb{P}_{H_{0}}\{R\leq r\}\mbox{ and }P_{n,m}(r-1)=1-\mathbb{P}_{H_{0}}\{R\geq r\}.\]

\begin{equation}{\label{sch5ex6}}\tag{10}\mbox{}\end{equation}

Example \ref{sch5ex6}. Five days of observation of a system yielded the following values of a certain quantity of interest \(132,104,162,171,129\) A \(10\)-day simulation of a proposed model of this system yielded the values \(107,94,136,99,114,122,108,130,106,88\). Suppose the formulated model implies that these daily values should be independent and have a common distribution. To determine the \(p\)-value that results from the above data, note first that \(R\), the sum of the ranks of the first sample is

\[R=12+4+14+15+10=55\]

Then the \(p\)-value computed by using the recursive formula is \(0.0752579\). \(\sharp\)

The difficulty with employing the recursion (\ref{sch5eq14}) to compute the \(p\)-value is that the amount of computation needed grows enormously as the sample sizes increase. For example, if \(n=m=20\), even if we choose the test quantity to be the smaller sum of ranks, then since the sum of all the ranks is \(1+2+\cdots +40=820\), it is possible that the test statistic could have a value as large as \(410\). Hence, there can be as many as \(20\times 20\times 410=164000\) values of \(P_{n,m}(r)\) that would have to be computed to determine the \(p\)-value. Thus, for large samples, the use of the recursion provided by (\ref{sch5eq14}) may not be viable. Two different approximation methods that can be used in such cases are a classical approach based on approximating the distribution of \(R\) by the central limit theorem and the simulation technique.

To use the classical approach for approximating the \(p\)-value we make use of the fact that under \(H_{0}\) all possible orderings of the \(n+m\) values are equally likely. Using this fact it is easy to show that

\[\mathbb{E}_{H_{0}}[R]=\frac{n(n+m+1)}{2}\mbox{ and }\mbox{Var}_{H_{0}}(R)=\frac{nm(n+m+1)}{12}.\]

Now, it can be shown that, under \(H_{0}\), when \(n\) and \(m\) are large, \(R\) is approximately normally distributed. Hence, when \(H_{0}\) is true,

\[\frac{R-n(n+m+1)/2}{\sqrt{nm(n+m+1)/12}}\mbox{ is approximately }N(0,1).\]

Since, for a normal random variable \(W\), the minimum of \(\mathbb{P}\{W\leq r\}\) and \(\mathbb{P}\{W\geq r\}\) is the former when \(r\leq\mathbb{E}(W)\), and the latter otherwise, it follows that when \(n\) and \(m\) are not too small (both being greater than \(7\) should suffice), we can approximate the \(p\)-value of the test result \(R=r\) by

\begin{equation}{\label{sch5eq15}}\tag{11}
p\mbox{-value}\approx\left\{\begin{array}{ll}
2\mathbb{P}\{Z<r^{*}\} & \mbox{if \({\displaystyle r\leq\frac{n(n+m+1)}{2}}\)}\\
2\mathbb{P}\{Z>r^{*}\} & \mbox{otherwise}
\end{array}\right .
\end{equation}

where

\[r^{*}=\frac{r-\frac{n(n+m+1)}{2}}{\sqrt{\frac{nm(n+m+1)}{12}}}\]

and \(Z\) is \(N(0,1)\).

Example. Let us see how well the classical approximation works for the data of Example \ref{sch5ex6}. In this case, since \(n=5\) and \(m=10\), we have

\begin{align*} p\mbox{-value} & =2\mathbb{P}_{H_{0}}\{R\geq 55\}\\ & \approx 2\mathbb{P}\left\{Z\geq\frac{55-40}
{\sqrt{50\times 16/12}}\right\}\\ & =2\mathbb{P}\{Z\geq 1.8371\}=0.066\end{align*}

which should be compared with the exact answer \(0.075\). \(\sharp\)

The \(p\)-value of the two-sample rank test can also be approximated by simulation. To see how this is accomplished, recall that if the observed value of the test quantity \(R\) is \(R=r\), then the \(p\)-value is given by

\[p\mbox{-value}=2\min\{\mathbb{P}_{H_{0}}\{R\geq r\},\mathbb{P}_{H_{0}}\{R\leq r\}\}\]

Now, under \(H_{0}\), provided that all the \(n+m\) data values are distinct, it follows that all orderings among these data values are equally likely, and thus the ranks of the first data set of size \(n\) have the same distribution as a random selection of \(n\) of the values \(1,2,\cdots ,n+m\). Thus, under \(H_{0}\), the probability distribution of \(R\) can be approximated by continually simulating a random subset of \(n\) of the integers \(1,2,\cdots ,n+m\) and determining the sum of the elements in the subset. The value of \(\mathbb{P}_{H_{0}}\{R\leq r\}\) can be approximated by the proportion of simulations that result in a sum less than or equal to \(r\), and the value of \(\mathbb{P}_{H_{0}}\{R\geq r\}\) by the proportion of simulations that result in a sum greater than or equal to \(r\). Using the data in Example \ref{sch5ex6} with \(1000\) runs, the approximated \(p\)-value is \(0.07\).

The above analysis supposes that all \(n+m\) data values are all distinct. When certain of the values have a common value, one should take as the rank of a data value the average of the ranks of the values equal to it. For example, if the first data set is \(2,3,4\) and the second \(3,5,7\), then the sum of the ranks of the first set is \(1+2.5+4=7.5\). The \(p\)-value should be approximated by using the normal approximation via (\ref{sch5eq15}).

A generalization of the two-sample problem is the multisample problem, where one has the following \(m\) data sets

\[\begin{array}{cccc}
X_{11}, & X_{12}, & \cdots , & X_{1n_{1}}\\
X_{21}, & X_{22}, & \cdots , & X_{2n_{2}}\\
\vdots & \vdots & \vdots & \vdots\\
X_{m1}, & X_{m2}, & \cdots , &X_{mn_{m}}
\end{array}\]

and we are interested in testing the null hypothesis \(H_{0}\) that all the \(n=\sum_{i=1}^{m} n_{i}\) random variables are independent and have a common distribution. A generalization of the two-sample rank test, called the multisample rank test, or often referred to as the Kruskal-Wallis test, is obtained by first ranking all the \(n\) data values. Then let \(R_{i}\) for \(i=1,\cdots ,m\), denote the sum of the ranks of all the \(n_{i}\) data values from the \(i\)th set. Since, under \(H_{0}\), all orderings are equally likely, provided all the data values are distinct, it follows exactly as before with \(\mathbb{E}[R_{i}]=\frac{n_{i}(n+1)}{2}\). Using the above, the multi-sample rank sum test is based on the test quantity

\[R=\frac{12}{n(n+1)}\sum_{i=1}^{m}\frac{(R_{i}-n_{i}(n+1)/2)^{2}}{n_{i}}.\]

Since small values of \(R\) indicate a good fit to \(H_{0}\), the test based on the quantity \(R\) rejects \(H_{0}\) for sufficiently large values of \(R\). Indeed, if the observed value of \(R\) is \(R=r\), the \(p\)-value of this result is given by

\[p\mbox{-value}=\mathbb{P}_{H_{0}}\{R\geq r\}.\]

This value can be approximated by using the result that for large values of \(n_{1},\cdots ,n_{m}\), \(R\) has approximately a chi-square distribution with \(m-1\) degrees of freedom (this latter result being the reason why we include the term \(12/n(n+1)\) in the definition of \(R\)). Hence, if \(R=r\),

\[p\mbox{-value}\approx\mathbb{P}\{\chi^{2}_{m-1}\geq r\}\]

Simulation can also be used to evaluate the \(p\)-value. Even when the data values are not all distinct, the above approximation for the \(p\)-value should be used. In computing the value of \(R\) the rank of an individual data value should be, as before, the average of all the ranks of the data equal to it.

 

Hsien-Chung Wu
Hsien-Chung Wu
文章: 183

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *