Variance Reduction Techniques

James McDougal Hart (1828-1901) was a Scottish-born American landscape painter.

We have sections

In a typical scenario for a simulation study, one is interested in determining \(\theta\), a parameter connected with some stochastic model. To estimate \(\theta\), the model is simulated to obtain, among other things, the output datum \(X\) which is such that \(\theta =\mathbb{E}[X]\). Repeated simulation runs, the \(i\)th one yielding the output variable \(X_{i}\), are performed. The simulation study is then terminated when \(n\) runs have been performed and the estimate of \(\theta\) is given by \(\bar{X}= \sum_{i=1}^{n} X_{i}/n\). Because this results in an unbiased estimate of \(\theta\), it follows that its mean square error is equal to its variance. That is,

\begin{align*} \mbox{MSE} & =\mathbb{E}[(\bar{X}-\theta )^{2}]\\ & =\mbox{Var}(\bar{X})=\frac{\mbox{Var}(X)}{n}.\end{align*}

Hence, if we can obtain a different unbiased estimate of \(\theta\) having a smaller variance than does \(\bar{X}\), we would obtain an improved estimator.

\begin{equation}{\label{a}}\tag{A}\mbox{}\end{equation}

The Use of Antithetic Variables.

Suppose that we are interested in using simulation to estimate \(\theta =\mathbb{E}[X]\), and suppose that we have generated \(X_{1}\) and \(X_{2}\) that are identically distributed random variables having mean \(\theta\). Then, we have

\[\mbox{Var}\left (\frac{X_{1}+X_{2}}{2}\right )=\frac{1}{4}(\mbox{Var}(X_{1})+\mbox{Var}(X_{2})+2\mbox{Cov}(X_{1},X_{2})).\]

Therefore, it would be advantageous, in the sense that the variance would be reduced, if \(X_{1}\) and \(X_{2}\) rather than being independent were negatively correlated. To see how we might arrange for \(X_{1}\) and \(X_{2}\) to be negatively correlated, suppose that \(X_{1}\) is a function of \(n\) \(U(0,1)\) random variables; that is, suppose that

\[X_{1}=h(U_{1},U_{2},\cdots ,U_{n}),\]

where \(U_{1},U_{2},\cdots ,U_{n}\) are \(n\) independent random variables. Now if \(U\) is \(U(0,1)\), then so is \(1-U\). Hence the random variable

\[X_{2}=h(1-U_{1},1-U_{2},\cdots ,1-U_{n})\]

has the same distribution as \(X_{1}\). In addition, since \(1-U\) is clearly negatively correlated with \(U\), we might hope that \(X_{2}\) might be negatively correlated with \(X_{1}\). The following propositions describe the results.

\begin{equation}{\label{sch4p1}}\tag{1}\mbox{}\end{equation}

Proposition \ref{sch4p1}. If \(X_{1},X_{2},\cdots ,X_{n}\) are independent, then for any increasing functions \(f\) and \(g\) of \(n\) variables

\[\mathbb{E}[f({\bf X})g({\bf X})]\geq \mathbb{E}[f({\bf X})]\mathbb{E}[g({\bf X})]\]

where \({\bf X}=(X_{1},X_{2},\cdots ,X_{n})\). In other words, we have

\[\mbox{Cov}(f({\bf X}),g({\bf X}))\geq 0.\]

Corollary. If \(h(x_{1},x_{2},\cdots ,x_{n})\) is a monotone function of each of its arguments, then, for a set \(U_{1},U_{2},\cdots ,U_{n}\) of independent \(U(0,1)\) random variables, we have

\[\mbox{Cov}(h(U_{1},U_{2},\cdots ,U_{n}),h(1-U_{1},1-U_{2},\cdots ,1-U_{n}))\leq 0.\]

In this case, after we have generated \(u_{1},u_{2},\cdots ,u_{n}\) so as to compute \(x_{1}\), rather than generating a new independent set of \(n\) random numbers, we do better by just using the set \(1-u_{1},1-u_{2}, \cdots 1-u_{n}\) to compute \(x_{2}\). In addition, it should be noted that we obtain a double benefit; namely, not only does our resulting estimator have smaller variance (at least when \(h\) is a monotone function), but we are saved the time of generating a second set of random numbers.

\begin{equation}{\label{sch4ex1}}\tag{2}\mbox{}\end{equation}

Example \ref{sch4ex1}. (Simulating the Reliability Function). Consider a system of \(n\) components, each of which is either functioning or failed. Let

\[s_{i}=\left\{\begin{array}{ll}
1 & \mbox{if component \(i\) works}\\
0 & \mbox{otherwise}
\end{array}\right .\]

We call \(s=(s_{1},s_{2},\cdots ,s_{n})\) the state vector. Suppose also that there is a nondecreasing function \(\phi (s_{1},s_{2},\cdots ,s_{n})\) satisfying

\[\phi (s_{1},s_{2},\cdots ,s_{n})=\left\{\begin{array}{ll}
1 & \mbox{if the system works under state vector \((s_{1},\cdots ,s_{n})\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

The function \(\phi (s_{1},s_{2},\cdots ,s_{n})\) is called the structure function. Some common structure functions are presented below

(i) The series structure. For the series structure

\[\phi (s_{1},s_{2},\cdots ,s_{n})=\min_{i} s_{i}.\]

The series system works only if all its components function.

(ii) The parallel structure. For the parallel structure

\[\phi (s_{1},s_{2},\cdots ,s_{n})=\max_{i} s_{i}\]

Hence, the parallel system works if at least one of its components works.

(iii) The \(k\)-of-$n$ system. The structure function

\[\phi (s_{1},s_{2},\cdots ,s_{n})=\left\{\begin{array}{ll}
1 & \mbox{if \(\sum_{i=1}^{n} s_{i}\geq k\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

is called a \(k\)-of-$n$ structure function. Since \(\sum_{i=1}^{n} s_{i}\) represents the number of functioning components, a \(k\)-of-$n$ system works if at least \(k\) of the \(n\) components are working. It should be noted that a series system is an \(n\)-of-$n$ system, whereas a parallel system is a \(1\)-of-$n$ system.

(iv) The bridge structure. A five-component system for which

\[\phi (s_{1},s_{2},s_{3},s_{4},s_{5})=\max\{s_{1}s_{3}s_{5},s_{2}s_{3}s_{4},s_{1}s_{4},s_{2}s_{5}\}\]
is said to have a bridge structure.

Suppose now that the states of the components, called \(S_{i}\) for \(i=1,2,\cdots ,n\), are independent random variables satisfying

\[\mathbb{P}\{S_{i}=1\}=p_{i}=1-\mathbb{P}\{S_{i}=0\}\]

for \(i=1,2,\cdots ,n\). Let

\[r(p_{1},\cdots ,p_{n})=\mathbb{P}\{\phi (S_{1},\cdots ,S_{n})=1\}=\mathbb{E}[\phi (S_{1},\cdots ,S_{n})].\]

The function \(r(p_{1},\cdots ,p_{n})\) is called the reliability function. It represents the probability that the system will work when the components are independent with component \(i\) functioning with probability \(p_{i}\) for \(i=1,2,\cdots ,n\). For a series system

\begin{align*} r(p_{1},\cdots ,p_{n}) & =\mathbb{P}\{S_{i}=1\mbox{ for all }i\}\\ & =\prod_{i=1}^{n}\mathbb{P}\{S_{i}=1\}=\prod_{i=1}^{n} p_{i}\end{align*}

and, for a parallel system

\begin{align*}
r(p_{1},\cdots p_{n}) & =\mathbb{P}\{S_{i}=1\mbox{ for at least one }i\}\\
& =1-\mathbb{P}\{S_{i}=0\mbox{ for all }i\}\\
& =1-\prod_{i=1}^{n} \mathbb{P}\{S_{i}=0\}=1-\prod_{i=1}^{n} (1-p_{i})
\end{align*}

However, for most systems it remains a formidable problem to compute the reliability function (even for such small systems as a \(5\)-of-$10$ system or the bridge system it can be quite tedious to compute). Suppose that, for a given nondecreasing structure function \(\phi\) and given probability \(p_{1},p_{2},\cdots ,p_{n}\), we are interested in using simulation to estimate

\[r(p_{1},\cdots ,p_{n})=\mathbb{E}[\phi (S_{1},\cdots ,S_{n})]\]

Now, we can simulate the \(S_{i}\) by generating uniform random numbers \(u_{1},\cdots ,u_{n}\) and then setting

\[S_{i}=\left\{\begin{array}{ll}
1 & \mbox{if \(u_{i}<p_{i}\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

Therefore, we have

\[\phi (S_{1},\cdots ,S_{n})=h(u_{1},\cdots ,u_{n})\]

where \(h\) is a decreasing function of \(u_{1},\cdots ,u_{n}\). Therefore, we also have

\[\mbox{Cov}(h({\bf U}), h({\bf 1}-{\bf U}))\leq 0.\]

The antithetic variable approach of using \(U_{1},\cdots ,U_{n}\) to generate both \(h({\bf U})\) and \(h({\bf 1}-{\bf U})\) results in a smaller variance that if an independent set of random numbers was used to generate the second value of \(h\). \(\sharp\)

Oftentimes, the relevant output of a simulation is a function of the input random variables \(Y_{1},Y_{2},\cdots ,Y_{n}\). That is, the relevant output is \(X=h(Y_{1},\cdots ,Y_{n})\). Suppose that \(Y_{i}\) having distribution \(F_{i}\) for \(i=1,2,\cdots, n\). If these input variables are generated by the inverse transform technique, we can write

\[x=h(F_{1}^{-1}(u_{1}),\cdots ,F_{n}^{-1}(u_{n}))\]

where \(u_{1},\cdots ,u_{n}\) are independent random numbers. Since a distribution function is increasing, it follows that its inverse is also increasing and thus if \(h(y_{1},\cdots ,y_{n})\) were a monotone function of its coordinates, then it follows that \(h(F_{1}^{-1}(u_{1}),\cdots , F_{n}^{-1}(u_{n}))\) will be a monotone function of the \(u_{i}\). Therefore, the method of antithetic variables, which would first generate \(u_{1},\cdots ,u_{n}\) to compute \(x_{1}\) and then use \(1-u_{1},\cdots ,1-u_{n}\) to compute \(x_{2}\), would result in an estimator having a smaller variance than would have been obtained if a new set of random numbers were used for \(X_{2}\).

Example. (Simulating a Queuing System). Considering a given queuing system, and let \(D_{i}\) denote the delay in queue of the \(i\)th arriving customer, and suppose we are interested in simulating the system so as to estimate \(\theta =\mathbb{E}[X]\), where

\[X=D_{1}+\cdots +D_{n}\]

is the sum of the delays in queue of the first \(n\) arrivals. Let \(I_{1},I_{2},\cdots ,I_{n}\) denote the first \(n\) inter-arrival times (i.e. \(I_{j}\) is the time between the arrivals of customers \(j-1\) and \(j\)), and let \(S_{1},\cdots ,S_{n}\) denote the first \(n\) service times of this system, and suppose that these random variables are all independent. Now, in many systems \(X\) is a function of the \(2n\) random variables $latex I_{1},
\cdots ,I_{n},S_{1},\cdots ,S_{n}$, say

\[X=h(I_{1},\cdots ,I_{n},S_{1},\cdots ,S_{n}).\]

Also, as the delay in queue of a given customer usually increases (depending of course on the specifics of the model) as the service times of the other customers increase and usually decreases as the times between arrivals increase, it follows that, for many models, \(h\) is a monotone function of its coordinates. Hence, if the inverse transform method is used to generate the random variables \(I_{1},\cdots ,I_{n},S_{1},\cdots ,S_{n}\), then the antithetic variable approach results in a smaller variance. That is, if we initially use the \(2n\) random numbers \(u_{i}\), \(i=1,\cdots ,2n\), to generate the inter-arrival and service times by setting \(I_{i}=F_{i}^{-1} (u_{i})\), \(S_{i}=G_{i}^{-1}(u_{n+i})\), where \(F_{i}\) and \(G_{i}\) are, respectively, the distribution functions of \(I_{i}\) and \(S_{i}\), then the second simulation run should be done in the same fashion, but using the random numbers \(1-u_{i}\) for \(i=1,\cdots ,2n\). This results in a smaller variance than if a new set of \(2n\) random numbers were generated for the second run. \(\sharp\)

\begin{equation}{\label{sch4ex2}}\tag{3}\mbox{}\end{equation}

Example \ref{sch4ex2}. Suppose we were interested in using simulation to estimate

\[\theta =\mathbb{E}[e^{U}]=\int_{0}^{1} e^{x}dx\]

We know that \(\theta =e-1\). Since the function \(h(u)=e^{u}\) is clearly a monotone function, the antithetic variable approach leads to a variance reduction, whose value we now determine. To begin, note that

\begin{align*} \mbox{Cov}(e^{U},e^{1-U}) & =\mathbb{E}[e^{U}e^{1-U}]-\mathbb{E}[e^{U}]\mathbb{E}[e^{1-U}]\\ & =e-(e-1)^{2}=-0.2342.\end{align*}

Also, since

\begin{align*} \mbox{Var}(e^{U}) & =\mathbb{E}[e^{2U}]-(E[e^{U}])^{2}\\ & =\frac{e^{2}-1}{2}-(e-1)^{2}=0.2420,\end{align*}

we see that the use of independent random numbers results in a variance of

\[\mbox{Var}\left (\frac{e^{U_{1}}+e^{U_{2}}}{2}\right )=\frac{\mbox{Var}(e^{U})}{2}=0.1210\]

whereas the use of the antithetic variables \(U\) and \(1-U\) gives a variance of

\[\mbox{Var}\left (\frac{e^{U}+e^{1-U}}{2}\right )=\frac{\mbox{Var}(e^{U})}{2}+\frac{\mbox{Cov}(e^{U},e^{1-U})}{2}=0.0039\]

a variance reduction of \(96.7\) percent. \(\sharp\)

In the case of a normal random variable having mean \(\mu\) and variance \(\sigma^{2}\), we can use the antithetic variable approach by first generating such an random variable \(Y\) and then taking as the antithetic variable \(2\mu -Y\), which is also normal with mean \(\mu\) and variance \(\sigma^{2}\) and is clearly negatively correlated with \(Y\). If we were using simulation to compute \(\mathbb{E}[h(Y_{1},\cdots ,Y_{n})]\), where \(Y_{i}\) are independent normal random variables and \(h\) is a monotone function of its coordinates, then the antithetic approach of first generating the \(n\) normals \(Y_{1},\cdots ,Y_{n}\) to compute \(h(Y_{1},\cdots ,Y_{n})\) and then using the antithetic variables \(2\mu_{i}-Y_{i}\), \(i=1,2,\cdots ,n\), to compute the next simulated value of \(h\) would lead to a reduction in variance as compared with generating a second set of \(n\) normal random variables.

\begin{equation}{\label{b}}\tag{B}\mbox{}\end{equation}

Using Common Random Numbers.

Suppose that a set of \(n\) jobs are to be processed by either a pair of identical machines. Let \(T_{i}\) denote the processing time for job \(i\), \(i=1,\cdots ,n\). We are interested in comparing the time it takes to complete the processing of all the jobs under two different policies for deciding the order in which to process jobs. Whenever a machine becomes free, the first policy, called the longest job first, always chooses the remaining jobs having the longest processing time, whereas the second policy, called shortest job first, always selects the one having the shortest processing time. For example, if \(n=3\) and \(T_{1}=2\), \(T_{2}=5\), and \(T_{3}=3\), then the longest job first would complete processing at time \(5\), whereas the shortest job first would not get done until time \(7\). We would like to use simulation to compare the expected difference in the completion times under these two policies when the times to process jobs \(T_{1},\cdots ,T_{n}\), are random variables having a given distribution \(F\). In other words, if \(g(t_{1},\cdots ,t_{n})\) is the time it takes to process the \(n\) jobs having processing times \(t_{1},\cdots ,t_{n}\) when we use the longest job first policy and if \(h(t_{1},\cdots ,t_{n})\) is the time when we use the shortest job first policy, then we are interested in using simulation to estimate \(\theta =\theta_{1}-\theta_{2}\), where \(\theta_{1}=\mathbb{E}[g({\bf T})]\), \(\theta_{2}=\mathbb{E}[h({\bf T})]\) and \({\bf T}=(T_{1},\cdots ,T_{n})\). If we now generate the vector \({\bf T}\) to compute \(g({\bf T})\), the question arises whether we should use the same generated values to compute \(h({\bf T})\) or whether it is more efficient to generate an independent set to estimate \(\theta_{2}\). To answer this question suppose that we used \({\bf T}^{*}=(T_{1}^{*},\cdots ,T_{n}^{*})\), having the same distribution as \({\bf T}\), to estimate \(\theta_{2}\). Then the variance of the estimator \(g({\bf T})-h({\bf T}^{*})\) of \(\theta\) is

\begin{align}
\mbox{Var}(g({\bf T})-h({\bf T}^{*})) & =\mbox{Var}(g({\bf T}))+\mbox{Var}(h({\bf T}^{*}))-2\mbox{Cov}(g({\bf T}),h({\bf T}^{*}))\nonumber\\
& =\mbox{Var}(g({\bf T}))+\mbox{Var}(h({\bf T}))-2\mbox{Cov}(g({\bf T}),h({\bf T}^{*}))\label{sch4eq11}\tag{4}
\end{align}

If \(g({\bf T})\) and \(h({\bf T})\) are positively correlated, then the variance of the estimator of \(\theta\) is smaller if we use the same set of generated random values \({\bf T}\) to compute both \(g({\bf T})\) and \(h({\bf T})\) than it would be if we use an independent set \({\bf T}^{*}\) to compute \(h({\bf T}^{*})\) (in this latter case the covariance in (\ref{sch4eq11}) would be \(0\)). Since both \(g\) and \(h\) are increasing functions of their arguments, it follows, because increasing functions of independent random variables are positively correlated from Proposition \ref{sch4p1}, that in the above case it is more efficient to successively compare the policies by always using the same set of generated job times for both policies.

As a general rule of thumb when comparing different operating policies in a randomly determined environment, after environmental state has been simulated one should then evaluate all the policies for this environment. That is, if the environment is determined by the vector \({\bf T}\) and \(g_{i}({\bf T})\) is the return from policy \(i\) under the environmental state \({\bf T}\), then after simulating the value of the random vector \({\bf T}\) one should then evaluate, for that value of \({\bf T}\), all the returns \(g_{i}({\bf T})\).

\begin{equation}{\label{c}}\tag{C}\mbox{}\end{equation}

The Use of Control Variates.

Suppose that we want to use simulation to estimate \(\theta =\mathbb{E}[X]\), where \(X\) is the output of a simulation. Now suppose that for some other output variable \(Y\), the expected value of \(Y\) is known, say \(\mathbb{E}[Y]=\mu_{Y}\). Then, for any constant \(c\), the quantity

\begin{equation}{\label {sch4eq1}}\tag{5}
X+c(Y-\mu_{Y})
\end{equation}

is also an unbiased estimator of \(\theta\). To determine the best value of \(c\), we note that

\[\mbox{Var}(X+c(Y-\mu_{Y}))=\mbox{Var}(X+cY)=\mbox{Var}(X)+c^{2}\mbox{Var}(Y)+2c\cdot\mbox{Cov}(X,Y).\]

Simple calculus now shows that the above is minimized when \(c=c^{*}\), where

\[c^{*}=-\frac{\mbox{Cov}(X,Y)}{\mbox{Var}(Y)}\]

and for this value the variance of the estimator is

\begin{equation}{\label{sch4eq2}}\tag{6}
\mbox{Var}(X+c^{*}(Y-\mu_{Y}))=\mbox{Var}(X)-\frac{\mbox{Cov}^{2}(X,Y)}{\mbox{Var}(Y)}.
\end{equation}

The quantity \(Y\) is called a control variate for the simulation estimator \(X\). To see why it works, note that \(c^{*}\) is negative (positive) when \(X\) and \(Y\) are positively (negatively) correlated. Suppose that \(X\) and \(Y\) are positively correlated, meaning, roughly, that \(X\) is large when \(Y\) is large and vice versa. Hence, if a simulation run results in a large (small) value of \(Y\), which is indicated by \(Y\) being larger than its known mean \(\mu_{Y}\), then it is probably true that \(X\) is also larger (smaller) than its mean \(\theta\), and so we would like to correct for this by lowering (raising) the value of the estimator, and this is done since \(c^{*}\) is negative (positive) (see equation (\ref{sch4eq1})). A similar argument holds when \(X\) and \(Y\) are negatively correlated. Upon dividing equation (\ref{sch4eq2}) by \(\mbox{Var}(X)\), we obtain

\[\frac{\mbox{Var}(X+c^{*}(Y-\mu_{Y}))}{\mbox{Var}(X)}=1-\rho^{2}(X,Y),\]

where

\[\rho (X,Y)=\frac{\mbox{Cov}(X,Y)}{\sqrt{\mbox{Var}(X)\mbox{Var}{Y}}}\]

is the correlation of \(X\) and \(Y\). Hence the variance reduction obtained in using the control variate \(Y\) is \(100\rho^{2}(X,Y)\) percent. The quantities \(\mbox{Cov}(X,Y)\) and \(\mbox{Var}(Y)\) are usually not known in advance and must be estimated from the simulated data. If \(n\) simulation runs are performed, and the output data \(X_{i},Y_{i}\), \(i=1,\cdots ,n\), result, then using the estimators

\[\widehat{\mbox{Cov}}(X,Y)=\sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y})/(n-1)\]

and

\[\widehat{\mbox{Var}}(Y)=\sum_{i=1}^{n} (Y_{i}-\bar{Y})^{2}/(n-1),\]

we can approximate \(c^{*}\) by \(\widehat{c^{*}}\), where

\[\widehat{c^{*}}=-\frac{\sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n} (Y_{i}-\bar{Y})^{2}}.\]

The variance of the control estimator

\[\mbox{Var}(\bar{X}+c^{*}(\bar{Y}-\mu_{Y}))=\frac{1}{n}\left (\mbox{Var}(X)-\frac{\mbox{Cov}^{2}(X,Y)}{\mbox{Var}(Y)}\right )\]

can then be simulated by using the estimator of \(\mbox{Cov}(X,Y)\) along with the sample variance estimators of \(\mbox{Var}(X)\) and \(\mbox{Var}(Y)\).

Another way of doing the computation is to make use of a standard computer package for simple linear regression models. If we consider the simple linear regression model

\[X=a+bY+\epsilon\]

where \(\epsilon\) is a random variable with mean \(0\) and variance \(\sigma^{2}\), then \(\widehat{a}\) and \(\widehat{b}\), the least squares estimators of \(a\) and \(b\) based on the data \(X_{i},Y_{i}\), \(i=1,\cdots ,n\), are

\[\widehat{b}=\frac{\sum_{i=1}^{n} (X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i=1}^{n} (Y_{i}-\bar{Y})^{2}}\]

and

\[\widehat{a}=\bar{X}-\widehat{b}\bar{Y}.\]

Therefore, we have \(\widehat{b}=-\widehat{c^{*}}\). In addition, since

\begin{align*} \bar{X}+\widehat{c^{*}}(\bar{Y}-\mu_{Y}) & =\bar{X}-\widehat{b}(\bar{Y}-\mu_{Y})\\ & =\widehat{a}+\widehat{b}\mu_{Y}\end{align*}

it follows that the control variate estimate is the evaluation of the estimated regression line at the value \(Y=\mu_{Y}\). Also, because \(\widehat{\sigma^{2}}\), the regression estimate of \(\sigma^{2}\), is the estimate of

\[\mbox{Var}(X \widehat{b}Y)=\mbox{Var}(\bar{X}+\widehat{c^{*}}Y),\]

it follows that the estimated variance of the control variate estimator

\[\bar{X}+\widehat{c^{*}}(\bar{Y}-\mu_{Y})\]

is \(\widehat{\sigma^{2}}/n\).

Example. As in Example \ref{sch4ex1}, suppose that we want to use simulation to estimate the reliability function

\[r(p_{1},\cdots ,p_{n})=\mathbb{E}[\phi (S_{1},\cdots ,S_{n})],\]

where

\[S_{i}=\left\{\begin{array}{ll}
1 & \mbox{if \(u_{i}<p_{i}\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

Since \(\mathbb{E}(S_{i})=p_{i}\), it follows

\[\mathbb{E}\left [\sum_{i=1}^{n} S_{i}\right ]=\sum_{i=1}^{n} p_{i}\]

Therefore, we can use the number of working components, \(Y=\sum_{i=1}^{n} S_{i}\), as a control variate of the estimator \(X=\phi (S_{1},\cdots ,S_{n})\). Since \(\sum_{i=1}^{n} S_{i}\) and \(\phi (S_{1},\cdots ,S_{n})\) are both increasing functions of \(S_{i}\), they are positively correlated. Thus, the sign of \(c^{*}\) is negative. \(\sharp\)

Example. Considering a queuing system in which customers arrive in accordance with non-homogeneous Poisson process with intensity function \(\lambda (t)\). Suppose that the service times are independent random variables having distribution \(G\) and are also independent of the arrival times. Suppose that we are interested in estimating the total time spent in the system by all customers arriving before time \(t\). That is, we let \(W_{i}\) denote the amount of time that the \(i\)th entering customer spends in the system, then we are interested in \(\theta =\mathbb{E}[X]\), where \(X=\sum_{i=1}^{N(t)}W_{i}\) and \(N(t)\) is the number of arrivals by time \(t\). A natural quantity to use as a control in this situation is the total of the service times of all these customers. That is, let \(S_{i}\) denote the service time of the \(i\)th customer and set \(Y=\sum_{i=1}^{N(t)} S_{i}\). Since the service times are independent of \(N(t)\), it follows \(\mathbb{E}(Y)=\mathbb{E}(S)\mathbb{E}(N(t))\) where \(\mathbb{E}(S)\) means the mean service time, and \(\mathbb{E}(N(t))\) means  the mean number of arrivals by \(t\), are both known quantities. \(\sharp\)

Example. As in Example \ref{sch4ex2}, suppose that we are interested in using simulation to compute \(\theta =\mathbb{E}[e^{U}]\). Here, a natural variate to use a control is the random number \(U\). To see what sort of improvement over the raw estimator is possible, we note that

\begin{align*} \mbox{Cov}(e^{U},U) & =\mathbb{E}(Ue^{U})-\mathbb{E}(U)\mathbb{E}(e^{U})\\ & =\int_{0}^{1}xe^{x}dx-\frac{(e-1)}{2}=0.14086.\end{align*}

Since \(\mbox{Var}(U)=1/12\),  using (\ref{sch4eq2}), we have

\begin{align*} \mbox{Var}\left (e^{U}+c^{*}\left (U-\frac{1}{2}\right )\right ) & =\mbox{Var}(e^{U})-12\cdot 0.14086^{2}\\ & =0.2420-0.2380=0.0039\end{align*}

where the above used, from Example \ref{sch4ex2}, that \(\mbox{Var}(e^{U})=0.2420\). Hence, in this case, the use of the control variate \(U\) can lead to a variance reduction of up to \(98.4\) percent. \(\sharp\)

One can use more than a single variable as a control. For example, if a simulation results in output variables \(Y_{i}\) for \(i=1,\cdots ,k\), and \(\mathbb{E}(Y_{i})=\mu_{i}\) is known, then for any constants \(c_{i}\) for \(i=1,\cdots ,k\), we may use

\[X+\sum_{i=1}^{k} c_{i}(Y_{i}-\mu_{i})\]

as an unbiased estimator of \(\mathbb{E}[X]\). When multiple control variates are used, the computations can be performed by using a computer program for the multiple linear regression model

\[X=a+\sum_{i=1}^{k} b_{i}Y_{i}+\epsilon\]

where \(\epsilon\) is a random variable with mean \(0\) and variance \(\sigma^{2}\). Let \(\widehat{c^{*}_{i}}\) be the estimate of the best \(c_{i}\) for \(i=1,\cdots ,k\). Then, we have

\[\widehat{c_{i}^{*}}=-\widehat{b_{i}}\]

for \(i=1,\cdots ,k\), where \(\widehat{b_{i}}\) for \(i=1,\cdots ,k\) are the least squares regression estimates of \(b_{i}\) for \(i=1,\cdots ,k\). The value of the controlled estimate can be obtained from

\[\bar{X}+\sum_{i=1}^{k} \widehat{c_{i}^{*}}(\bar{Y}-\mu_{i})=\widehat{a}+\sum_{i=1}^{k} \widehat{b_{i}}\mu_{i}.\]

That is, the controlled estimate is just the estimated multiple regression line evaluated at the point \((\mu_{1},\cdots ,\mu_{k})\). The variance of the controlled estimate can be obtained by dividing the regression estimate of \(\sigma^{2}\) by the number of simulation runs.

\begin{equation}{\label{d}}\tag{D}\mbox{}\end{equation}

Variance Reduction by Conditioning.

Recall the conditional variance formula

\[\mbox{Var}(X)=\mathbb{E}[\mbox{Var}(X|Y)]+\mbox{Var}(\mathbb{E}[X|Y]).\]

Since both terms on the right are nonnegative and the variance is always nonnegative, we have \(\mbox{Var}(X)\geq \mbox{Var}(\mathbb{E}[X|Y])\). Suppose we are interested in performing a simulation study so as to ascertain the value of \(\theta =E[X]\), where \(X\) is an output variable of a simulation run. Let \(Y\) be second variable such that \(\mathbb{E}[X|Y]\) is known and takes on a value that can be determined from the simulation run. Since \(\mathbb{E}[\mathbb{E}[X|Y]]=\mathbb{E}[X]=\theta\), it follows that \(\mathbb{E}[X|Y]\) is also an unbiased estimator of \(\theta\). Therefore,  as an estimator of \(\theta\), the conditional expectation \(\mathbb{E}[X|Y]\) is superior to the raw estimator \(X\). One might consider further improvements by using an estimator of the type \(\alpha X+(1-\alpha )\mathbb{E}[X|Y]\). We shall show that \(\alpha\) will be zero. First of all, suppose that the values of \(X\) and \(W\) are both determined by the simulation, and that \(\mathbb{E}[X]=\mathbb{E}[W]=\theta\). Then, we may consider any unbiased estimator of the form \(\alpha X+(1-\alpha )W\). The best such estimator, which is obtained by choosing \(\alpha\) to minimize the variance, is given by letting \(\alpha =\alpha^{*}\), where

\[\alpha^{*}=\frac{\mbox{Var}(W)-\mbox{Cov}(X,W)}{\mbox{Var}(X)+\mbox{Var}(W)-2\mbox{Cov}(X,W)}.\]

Therefore, for the estimator \(\alpha X+(1-\alpha )\mathbb{E}[X|Y]\), the best estimator of this type has

\[\alpha =\alpha^{*}=\frac{\mbox{Var}(\mathbb{E}[X|Y])-\mbox{Cov}(X,\mathbb{E}[X|Y])}{\mbox{Var}(X)+\mbox{Var}(\mathbb{E}[X|Y])-2\mbox{Cov}(X,\mathbb{E}[X|Y])}\]

We now show that \(\alpha^{*}=0\), which means that combining the estimators \(X\) and \(\mathbb{E}[X|Y]\) does not improve on just using \(\mathbb{E}[X|Y]\). We note that

\begin{equation}{\label{sch4eq3}}\tag{7}
\mbox{Var}(\mathbb{E}[X|Y])=\mathbb{E}[\mathbb{E}^{2}[X|Y]]-(\mathbb{E}[\mathbb{E}[X|Y]])^{2}=\mathbb{E}[\mathbb{E}^{2}[X|Y]]-(\mathbb{E}[X])^{2}.
\end{equation}

On the other hand, using (\ref{sch4eq3}), we have

\begin{align*}
\mbox{Cov}(X,\mathbb{E}[X|Y]) & =\mathbb{E}[X\mathbb{E}[X|Y]]-\mathbb{E}[X]\mathbb{E}[\mathbb{E}[X|Y]]\\
& =\mathbb{E}[X\mathbb{E}[X|Y]]-(\mathbb{E}[X])^{2}\\
& =\mathbb{E}[\mathbb{E}[X\mathbb{E}[X|Y]|Y]-(\mathbb{E}[X])^{2}\mbox{ (conditioning on \(Y\))}\\
& =\mathbb{E}[\mathbb{E}[X|Y]\mathbb{E}[X|Y]]-(\mathbb{E}[X])^{2}\mbox{ (since given \(Y\), \(\mathbb{E}[X|Y]\) is a constant)}\\
& =\mbox{Var}(\mathbb{E}[X|Y])
\end{align*}

Thus, we see that no additional variance reduction is possible by combining the estimators \(X\) and \(\mathbb{E}[X|Y]\).

\begin{equation}{\label{sch1ex5}}\tag{8}\mbox{}\end{equation}

Example \ref{sch1ex5}. (The Estimation of \(\pi\)) Suppose that the random vector \((X,Y)\) is uniformly distributed in the square of area centered at the origin, where \(-1\leq X\leq 1\) and \(-1\leq Y\leq 1\). Let us consider the probability that this random point in the square is contained within the inscribed circle of radius \(1\). Note that since \((X,Y)\) is uniformly distributed in the square it follows that

\begin{align*} \mathbb{P}\{(X,Y)\mbox{ is in the circle}\} & =P\{X^{2}+Y^{2}\leq 1\}\\ & =
\frac{\mbox{Area of the circle}}{\mbox{Area of the square}}=\frac{\pi}{4}.\end{align*}

Therefore, if we generate a large number of random points in the square, the proportion of points that fall within the circle will be approximately \(\pi /4\). Note that if \(X\) and \(Y\) are independent and both are uniformly distributed over \((-1,1)\), their joint density would be

\begin{align*} f(x,y) & =f_{X}(x)f_{Y}(y)\\ & =\frac{1}{2}\cdot\frac{1}{2}=\frac{1}{4}\end{align*}

for \(-1\leq x\leq 1\) and \(-1\leq y\leq 1\). Since the density function of \((X,Y)\) is constant in the square, it  follows that \((X,Y)\) is uniformly distributed in the square. Now if \(U\) is uniformly on \((0,1)\) then \(2U\) is uniformly on \((0,2)\), and so \(2U-1\) is uniformly on \((-1,1)\). Therefore, if we generate random numbers \(u_{1}\) and \(u_{2}\), set \(x=2u_{1}-1\) and \(y=2u_{1}-1\), and define

\[I=\left\{\begin{array}{ll}
1 & \mbox{if \(X^{2}+Y^{2}=(2U_{1}-1)^{2}+(2U_{2}-1)^{2}\leq 1\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

then

\[\mathbb{E}[I]=P\{X^{2}+Y^{2}\leq 1\}=\frac{\pi}{4}.\]

Therefore, we can estimate \(\pi /4\) by generating a large number of pairs of random numbers \(u_{1}\), \(u_{2}\) and estimating \(\pi /4\) by the fraction
of pairs for which

\[(2u_{1}-1)^{2}+(2u_{2}-1)^{2}\leq 1\]

(The unbiased estimator of \(E[I]\) is \(\bar{I}=\frac{1}{n}\sum_{i=1}^{n} I_{i}\)). \(\sharp\)

\begin{equation}{\label{sch4ex11}}\tag{9}\mbox{}\end{equation}

Example \ref{sch4ex11}. Let us consider our use of simulation to estimate \(\pi\). In Example \ref{sch1ex5}, we showed how we can estimate \(\pi\) by determining how often a randomly chosen point in the square of area \(4\) centered around the origin falls within the inscribed circle of radius \(1\). Let \(v_{i}=2u_{i}-1\), where \(u_{i}\) for \(i=1,2,\) are random numbers, and set

\[I=\left\{\begin{array}{ll}
1 & \mbox{if \(V_{1}^{2}+V_{2}^{2}\)}\\
0 & \mbox{otherwise}
\end{array}\right .\]

Then \(\mathbb{E}[I]=\pi /4\). The use of the average of successive values of \(I\) to estimate \(\pi /4\) can be improved upon by using \(\mathbb{E}[I|V_{1}]\) rather than \(I\). Now, we have

\begin{align*}
\mathbb{E}[I|V_{1}=v_{1}] & =\mathbb{P}\{V_{1}^{2}+V_{2}^{2}\leq 1|V_{1}=v\}\\ & =\mathbb{P}\{v^{2}+V_{2}^{2}\leq 1|V_{1}=v\}\\
& =\mathbb{P}\{V_{2}^{2}\leq 1-v^{2}\}\mbox{ (by the independence of \(V_{1}\) and \(V_{2}\))}\\
& =\mathbb{P}\{-\sqrt{1-v^{2}}\leq V_{2}\leq\sqrt{1-v^{2}}\}\\
& =\int_{-\sqrt{1-v^{2}}}^{\sqrt{1-v^{2}}}\frac{1}{2}dx\mbox{ (since \(V_{2}\) is uniform over \((-1,1)\)}\\
& =\sqrt{1-v^{2}}
\end{align*}

Therefore, we obtain \(\mathbb{E}[I|V_{1}]=\sqrt{1-V_{1}^{2}}\). The estimator \(\sqrt{1-V_{1}^{2}}\) also has mean \(\pi /4\) and has a smaller variance than \(I\). Since

\begin{align*}
\mathbb{E}\left [\sqrt{1-V_{1}^{2}}\right ] & =\int_{-1}^{1} \sqrt{1-x^{2}}\cdot\frac{1}{2}dx\\
& =\int_{0}^{1} \sqrt{1-x^{2}}dx=\mathbb{E}\left [\sqrt{1-U^{2}}\right ]
\end{align*}

we can simplify somewhat by using the estimator \(\sqrt{1-U^{2}}\), where \(U\) is a random number. The improvement in variance obtained by using the estimator \(\sqrt{1-U^{2}}\) over the estimator \(I\) is easily determined

\begin{align*} \mbox{Var}\left [\sqrt{1-U^{2}}\right ] & =\mathbb{E}[1-U^{2}]-\left (\frac{\pi}{4}\right )^{2}\\ & =\frac{2}{3}-\left (\frac{\pi}{4}\right )^{2}=0.0498.\end{align*}

On the other hand, because \(I\) is a Bernoulli random variable having mean \(\pi /4\), we have

\[\mbox{Var}(I)=\left (\frac{\pi}{4}\right )\left (1-\frac{\pi}{4}\right )=0.1686.\]

Therefore,  showing that conditioning results in a \(70.44\) percent reduction in variance. Since the function \(\sqrt{1-u^{2}}\) is clearly a monotone decreasing function of \(u\) in the region \(0<u<1\), it follows that the estimator \(\sqrt{1-U^{2}}\) can be improved upon by using antithetic variables. That is, the estimator

\[\frac{1}{2}\left (\sqrt{1-U^{2}}+\sqrt{1-(1-U)^{2}}\right )\]

has smaller variance than

\[\frac{1}{2}\left (\sqrt{1-U_{1}^{2}}+\sqrt{1-U_{2}^{2}}\right ).\]

Another way of improving the estimator \(\sqrt{1-U^{2}}\) is by using a control variate. A natural control variable in this case is \(U^{2}\). Since \(\mathbb{E}[U^{2}]=1/3\), we could use an estimator of the type

\[\sqrt{1-U^{2}}+c\left (U^{2}-\frac{1}{3}\right ).\]

The best \(c\), namely,

\[c^{*}=-\mbox{Cov}\left (\sqrt{1-U^{2}},U^{2}\right )/\mbox{Var}(U^{2}),\]

can be estimated by using the simulation to estimate the covariance term. \(\sharp\)

Example. (The Reliability Function Revisited). Suppose that \(S_{j}\) for \(j=1,\cdots ,n\) are independent with

\[\mathbb{P}\{S_{j}=1\}=1-\mathbb{P}\{S_{j}=0\}=p_{j}\]

We are interested in estimating the reliability function \(\mathbb{E}[\phi (S_{1},\cdots ,S_{n})]\), where \(\phi\) is an increasing binary function, i.e. its only possible values are \(0\) and \(1\). If we generate the values of \(S_{1},\cdots ,S_{n}\), then an improvement over the raw estimator \(\phi (S_{1},\cdots ,S_{n})\) is to take its conditional expectation given all of the \(S_{j}\) except one. That is, for fixed \(i\),

\[\mathbb{E}[\phi (S_{1},\cdots ,S_{n})|S_{1},\cdots ,S_{i-1},S_{i+1},\cdots ,S_{n}]\]

will be an improved estimator. To use the above estimator we would, on any given simulation run, simulate all the component states \(S_{j}\) except for \(S_{i}\). The estimate for that simulation run will take on one of three possible values: either it will equal \(1\) if, given the state \(S_{j},j\neq i\), the system would function even if component \(i\) does not function; or it will equal \(0\) if the system will not function even component \(i\) does function; or it will equal \(p_{i}\) if the system will function if \(i\) functions and would be failed otherwise. The above estimator can also be improved upon either by using antithetic variables, since the estimator is still a monotone function, or by controlling on \(\sum_{j\neq i}S_{j}\) the number of the simulated components that work. \(\sharp\)

Example. We want to estimate the expected sum of the times in the system of the first \(n\) customers in a queuing system. That is, if \(W_{i}\) is the time that the \(i\)th customer spends in the system, we are interested in estimating \(\theta =\mathbb{E} [\sum_{i=1}^{n} W_{i}]\). Let \(S_{i}\) denote the “state of the system” at the moment that the \(i\)th customer arrives, and consider the estimator \(\sum_{i=1}^{n}\mathbb{E}[W_{i}|S_{i}]\). Since

\[\mathbb{E}\left [\sum_{i=1}^{n} \mathbb{E}[W_{i}|S_{i}]\right ]=\sum_{i=1}^{n} \mathbb{E}[\mathbb{E}[W_{i}|S_{i}]]=\sum_{i=1}^{n}\mathbb{E}[W_{i}]=\theta,\]

it follows that this is an unbiased estimator of \(\theta\). It can be shown that, in a wide class of models, this estimator has a smaller variance than the raw simulation estimator \(\sum_{i=1}^{n} W_{i}\). (It should be noted that whereas it is immediate that \(\mathbb{E}[W_{i}|S_{i}]\) has smaller variance than \(W_{i}\), this does not imply, because of the covariance term, that \(\sum_{i=1}^{n} \mathbb{E}[W_{i}|S_{i}]\) has smaller variance than \(\sum_{i=1}^{n} W_{i}\).) The quantity \(S_{i}\), which refers to the state of the system as seen by the \(i\)th customer upon its arrival, is supposed to represent the least amount of information that enables us to compute the conditional expected time that the customer spend in the system. For example, if there is a single server and the service times are all exponential with mean \(\mu\), then \(S_{i}\) would refer to \(N_{i}\), the number of customers in the system encountered by the \(i\)th arrival. In this case, we have

\[\mathbb{E}[W_{i}|S_{i}]=\mathbb{E}[W_{i}|N_{i}]=(N_{i}+1)\mu\]

which follows, since the \(i\)th arrival will have to wait for \(N_{i}\) service times (one of which is the completion of service of the customer presently being served when customer \(i\) arrives, but, by the memoryless property of the exponential, that remaining time will also be exponential with mean \(\mu\).) all having mean \(\mu\), and then to this we must add its own service time. Thus, the estimator that takes the average value, over all simulation runs, of the quantity \(\sum_{i=1}^{n} (N_{i}+1)\mu\) is a better estimator than the average value of \(\sum_{i=1}^{n} W_{i}\). \(\sharp\)

\begin{equation}{\label{e}}\tag{E}\mbox{}\end{equation}

Stratified Sampling.

Consider a finite capacity queuing system which has the property that it empties at the end of each day and then begins anew at the beginning of the next day. Customers arrive daily in accordance with a Poisson process, but suppose that the rate of that process can change each day. Specifically, suppose that for any given day the Poisson process arrival rate, independently of what occurred on prior days, is either \(\lambda_{1}\), \(\lambda_{2}\), or \(\lambda_{3}\), each with probability \(1/3\). We are interested in simulating this system so as to estimate the average number of customers that are lost, because they arrived when the system is at capacity, during a day. To simulate the above we begin each run by generating a random variable \(I\) which is equally likely to be \(1\), \(2\), or \(3\). We then simulate one day of the above system under the assumption that customers arrive in accordance with a Poisson process having rate \(\lambda_{I}\). Intuitively, however, since we know that each day is equally likely to have Poisson arrivals at rate \(\lambda_{i}\), \(i=1,2,3\), it seems that it might be preferable to perform exactly one-third of our simulation runs using each of these three rates. To verify the above intuition, let, for \(i=1,2,3\), \(L_{i}^{*}\) denote the number of lost customers in a day given that the arrivals during that day are Poisson with mean \(\lambda_{i}\). Also, let \(L\) denote the number of lost customers in a day. Then upon conditioning upon what the daily arrival rate is, we see that

\begin{equation}{\label{sch4eq5}}\tag{10}
\mathbb{E}[L]=\frac{1}{3}(\mathbb{E}[L_{1}^{*}]+\mathbb{E}[L_{2}^{*}]+\mathbb{E}[L_{3}^{*}]).
\end{equation}

Let \(L_{i}\), having the same distribution as \(L\), denote the number of lost customers on day \(i\). We now argue that

\begin{equation}{\label{sch4eq6}}\tag{11}
\mbox{Var}(L_{1}^{*}+L_{2}^{*}+L_{3}^{*})\leq\mbox{Var}(L_{1}+L_{2}+L_{3})
\end{equation}

To prove the inequality (\ref{sch4eq6}) we make us of the conditional variance formula

\[\mbox{Var}(X)=\mathbb{E}[\mbox{Var}(X|Y)]+\mbox{Var}(\mathbb{E}[X|Y]).\]

We  see that \(\mbox{Var}(X)\geq\mathbb{E}[\mbox{Var}(X|Y)]\). Hence, with \(L\) representing the number of lost
customers in a day and \(I\) being the index of the Poisson arrival rate for that day, we see that

\begin{align*}
\mbox{Var}(L) & \geq\mathbb{E}[\mbox{Var}(L|I)]=\sum_{i=1}^{3}\mbox{Var}(L|I=i)\cdot\mathbb{P}\{I=i\}\\
& =\frac{1}{3}\sum_{i=1}^{3}\mbox{Var}(L_{i}^{*})=\frac{1}{3}\mbox{Var}(L_{1}^{*}+L_{2}^{*}+L_{3}^{*})
\end{align*}

Since \(\mbox{Var}(L_{1}+L_{2}+L_{3})=3\mbox{Var}(L)\), this establishes (\ref{sch4eq6}). From (\ref{sch4eq5}), \((L_{1}^{*}+L_{2}^{*}+L_{3}^{*})/3\) is an unbiased estimator of \(\mathbb{E}[L]\), it follows that, from (\ref{sch4eq6}), using \(\lambda_{i}\) in exactly one-third of the simulation runs (as opposed to a random choosing at the beginning of each run) results in an
unbiased estimator having smaller variance.

\begin{equation}{\label{sch4ex8}}\tag{12}\mbox{}\end{equation}

Example \ref{sch4ex8}. On good days customers arrive at an infinite server queue according to a Poisson process with rate \(12\) per hour, whereas on the other days they arrive according to a Poisson process with rate \(4\) per hour. The service times, on all days, are exponentially distributed with rate \(1\) per hour. Every day at time \(10\) hours the system is shut down and all those presently in service are forced to leave without completing service. Suppose that each day is, independently, a good with probability \(0.5\), and that we want to use simulation to estimate \(\theta\), the mean number of customers per day that do not have their services completed. Let \(L\) denote the number of customers whose service is not completed on a randomly selected day, and let \(L_{g}\) and \(L_{o}\) denote the number on a good and on an ordinary day respectively. Now, it can be shown that \(L_{g}\) and \(L_{o}\) are both Poisson random variables with respective means \(12(1-e^{-10})\) and \(4(1-e^{-10})\). Since the variance of a Poisson random variable is equal to its mean, we have \(\mbox{Var}(L_{g})\approx 12\) and \(\mbox{Var}(L_{o})\approx 4\). Also, since \(L\) is equal to \(L_{g}\) with probability \(1/2\) and is equal to \(L_{o}\) with probability \(1/2\), it follows

\[\theta =\mathbb{E}[L]=\frac{1}{2}\mathbb{E}[L_{g}]+\frac{1}{2}\mathbb{E}[L_{o}]\approx 8\]

and

\begin{align*} \mathbb{E}[L^{2}] & =\frac{1}{2}\mathbb{E}[L_{g}^{2}]+\frac{1}{2}\mathbb{E}[L_{o}^{2}]\\ & \approx\frac{12+144+4+16}{2}=88,\end{align*}

where the preceding used the identity \(\mathbb{E}[X^{2}]=\mbox{Var}(X)+(\mathbb{E}[X])^{2}\). Therefore, we obtain \(\mbox{Var}(L)\approx 88-64=24\). If we first simulate \(L_{g}\) and \(L_{o}\), then the variance of the estimator \((L_{g}+L_{o})/2\) is

\[\mbox{Var}\left (\frac{L_{g}+L_{o}}{2}\right )=\frac{\mbox{Var}(L_{g})+\mbox{Var}(L_{o})}{4}\approx 4.\]

On the other hand, if we simulate two iterations of \(L\), then the variance of the estimator \((L_{1}+L_{2})/2\) is

\[\mbox{Var}\left (\frac{L_{1}+L_{2}}{2}\right )\approx\frac{24}{2}=12\]

which is larger that of the preceding estimator by a factor approximately equal to \(3\). \(\sharp\)

In the more general case, suppose that the possible arrival rates on each day are \(\lambda_{i}\), with respective probabilities \(p_{i}\), where \(\sum_{i=1}^{k} p_{i}=1\). In this case, we would have

\[\mathbb{E}[L]=\sum_{i=1}^{k}\mathbb{E}[L_{i}^{*}]p_{i},\]

where \(L_{i}^{*}\) is the number lost on a day in which customers arrive at rate \(\lambda_{i}\). Hence, rather than randomly choosing an arrival rate on each day, a similar argument to the one presented shows that it is better to perform \(100p_{i}\%\) of the runs with the arrival rate \(\lambda_{i}\), \(i=1,\cdots ,k\). If \(\bar{L}_{i}^{*}\) is the average number of lost customers in those runs that use the arrival rate \(\lambda_{i}\), then the estimator of \(\theta =E[L]\) based on \(n\) runs is

\[\widehat{\theta}=\sum_{i=1}^{k} p_{i}\bar{L}_{i}^{*}.\]

The variance of the estimator is given by

\begin{equation}{\label{sch4eq7}}\tag{13}
\mbox{Var}(\widehat{\theta})=\sum_{i=1}^{k}\frac{p_{i}^{2}\mbox{Var}(L_{i}^{*})}{n_{i}}
\end{equation}

where \(n_{i}=np_{i}\) is the number of runs that use the arrival rate \(\lambda_{i}\). If we let \(S_{i}^{2}\) denote the sample variance of the number of lost customers on the days that use the arrival rate \(\lambda_{i}\), then \(\mbox{Var}(\widehat{\theta})\) can be estimated by \(\sum_{i=1}^{k} p_{i}^{2}S_{i}^{2}/n_{i}\). Although performing \(np_{i}\) of the \(n\) runs with the arrival rate \(\lambda_{i}\), \(i=1,\cdots ,k\), is better than randomly choosing the rate at the beginning of each day, this is not necessarily the optimal number of runs to perform using each of these rates. Suppose that we plan to perform \(n\) simulation runs. If we let \(n_{i}\) denote the number of these runs that use the arrival rate \(\lambda_{i}\) for \(i=1,\cdots ,k\), then the estimator of \(\mathbb{E}[L]\) will be \(\sum_{i=1}^{k}p_{i}\bar{L}_{i}^{*}\), with its variance given by (\ref{sch4eq7}). Whereas the variances of \(L_{i}^{*}\) for \(i=1,\cdots ,k\), will be initially unknown, we could perform a small simulation study to estimate them, say we use the estimator. We could then choose \(n_{i}\) by solving the following optimization problem

\[\begin{array}{ll}
\min & {\displaystyle \sum_{i=1}^{k} \frac{p_{i}^{2}s_{i}^{2}}{n_{i}}}\\
\mbox{subject to} & {\displaystyle \sum_{i=1}^{k} n_{i}=n}.
\end{array}\]

Once the \(n_{i}\) are chosen we would estimate \(\mathbb{E}[L]\) by \(\sum_{i=1}^{k} p_{i}\bar{L}_{i}^{*}\) and we would estimate its variance by \(\sum_{i=1}^{k} p_{i}S_{i}^{2}/n_{i}\), where, as before, \(S_{i}^{2}\) is the sample variance for the \(n_{i}\) runs that use \(\lambda_{i}\). For instance, suppose that \(k=2\) and that we make \(nx\) runs using \(\lambda_{1}\) and \(n(1-x)\) runs using \(\lambda_{2}\). Let \(p=p_{1}\). Then, we want to choose \(x\) satisfying \(0\leq x\leq 1\) to minimize

\[\frac{ps_{1}^{2}}{nx}+\frac{(1-p)s_{2}^{2}}{n(1-x)}\]

and this is easily accomplished by using calculus. If, as in Example \ref{sch4ex8} for \(p=1/2\), \(s_{1}^{2}=12\), \(s_{2}^{2}=4\), then solving the preceding gives the minimizing value \(x\approx 0.634\), thus showing that we should do approximately \(63.4\) percent of the runs using \(\lambda =12\) and \(36.6\) percent with \(\lambda =4\). In general, if we want to estimate \(\mathbb{E}[X]\) in a situation where \(X\) depends on a random variable \(S\) that takes on one of the values \(1,\cdots ,k\) with known probabilities, then the technique of stratifying the simulation runs into \(k\) groups, with the \(i\)th group having \(S=i\), letting \(\bar{X}_{i}\) be the average value of \(X\) in those runs with \(S=i\), and then estimating \(\mathbb{E}[X]=\sum_{i=1}^{k}\mathbb{E}[X|S=i]\mathbb{P}\{S=i\}\) by \(\sum_{i=1}^{k} \bar{X}_{i}\mathbb{P}\{S=i\}\) is called stratified sampling.

\begin{equation}{\label{sch4ex10}}\tag{14}\mbox{}\end{equation}

Example \ref{sch4ex10}.We want to use \(n\) simulation runs to estimate the integral

\[\theta =\mathbb{E}[h(U)]=\int_{0}^{1} h(x)dx.\]

Let

\[S=j\mbox{ if }\frac{j-1}{n}\leq u<\frac{j}{n}\]

for \(j=1,\cdots ,n\) then we have

\begin{equation}{\label{sch4eq9}}\tag{15}
\theta =\sum_{j=1}^{n}\mathbb{E}[h(U)|S=j]\mathbb{P}\{S=j\}=\frac{1}{n}\sum_{j=1}^{n}\mathbb{E}[h(U)|S=j]=\frac{1}{n}\sum_{j=1}^{n} \mathbb{E}[h(U_{(j)})]
\end{equation}

where \(U_{(j)}\) is uniformly on \(((j-1)/n,j/n)\). That is, \((j-1)/n\leq U_{(j)}<j/n\). We see that if \(0\leq U_{j}<1\), then

\[\frac{j-1}{n}\leq\frac{U_{j}+j-1}{n}<\frac{j}{n}.\]

Hence, by the preceding, it follows that rather than generating \(u_{1},\cdots ,u_{n}\) and then using \(\sum_{j=1}^{n} h(u_{j})/n\) to estimate \(\theta\), a better estimator is obtained by using

\[\widehat{\theta}=\frac{1}{n}\sum_{j=1}^{n}h\left (\frac{u_{j}+j-1}{n}\right )\]

from (\ref{sch4eq9}).$\sjarp$

Example. From Example \ref{sch4ex11}, we have

\[\pi =4\mathbb{E}\left [\sqrt{1-U^{2}}\right ]\]

Using Example \ref{sch4ex10}, we can estimate \(\pi\) by generating \(u_{1},\cdots ,u_{n}\) and using the estimator

\[\widehat{\pi}=\frac{4}{n}\sum_{j=1}^{n}\sqrt{1-\left (\frac{u_{j}+j-1}{n}\right )^{2}}.\]

In fact, we can improve the preceding by making use of antithetic variables to obtain the estimator

\[\widehat{\pi}=\frac{2}{n}\sum_{j=1}^{n}\left (\sqrt{1-\left (\frac{u_{j}+j-1}{n}\right )^{2}}+\sqrt{1-\left (\frac{j-u_{j}}{n}\right )^{2}}\right ).\]

A simulation using the above estimator \(\widehat{\pi}\) yielded the following results

\[\begin{array}{cc}
n & \widehat{\pi}\\
\hline 5 & 3.161211\\
10 & 3.148751\\
100 & 3.141734\\
500 & 3.141615\\
1000 & 3.141601\\
5000 & 3.141593\\
\hline\end{array}\]

When \(n=5000\), the estimator \(\widehat{\pi}\) is correct to six decimal places. \(\sharp\)

Suppose again that we are interested in estimating \(\theta =\mathbb{E}[X]\), where \(X\) is dependent on the random variable \(S\), which takes on one of the values \(1,2,\cdots ,k\) with respective probabilities \(p_{i}\) for \(i=1,\cdots ,k\). Then, we have

\[\mathbb{E}[X]=\sum_{i=1}^{k} p_{i}\mathbb{E}[X|S=i].\]

If all of the quantities \(\mathbb{E}[X|S=i]\) are known (that is, if \(\mathbb{E}[X|S]\) is known), but the \(p_{i}\) are not, then we can estimate \(\theta\) by generating the value of \(S\) and then using the conditional expectation estimator \(\mathbb{E}[X|S]\). On the other hand, if it is the \(p_{i}\) that are known and we can generate from the conditional distribution of \(X\) given the value of \(S\), then we can use simulation to obtain estimators \(\widehat{\mathbb{E}}[X|S=i]\) of the quantities \(\mathbb{E}[X|S=i]\) and then use the stratified sampling estimator \(\sum_{i=1}^{k} p_{i}\widehat{\mathbb{E}}[X|S=i]\) to estimate \(\mathbb{E}[X]\). When some of the \(p_{i}\) and some of the \(\mathbb{E}[X|S=i]\) are known, we can use a combination of these approaches.

Importance Sampling.

Let \({\bf X}=(X_{1},\cdots ,X_{n})\) denote a vector of random variables having a joint density function (or joint mass function in the discrete case) \(f({\bf x})=f(x_{1},\cdots ,x_{n})\), and suppose that we are interested in estimating

\[\theta =\mathbb{E}[h({\bf X})]=\int h({\bf x})f({\bf x})d{\bf x}\]

where the preceding is an \(n\)-dimensional integral. (If the \(X_{i}\) are discrete, then interpret the integral as \(n\)-fold summation.) Suppose that a direct simulation of the random vector \({\bf X}\), so as to compute values of \(h({\bf X})\), is inefficient, possibly because it is difficult to simulate a random vector having density function \(f({\bf x})\), or the variance of \(h({\bf X})\) is large, or a combination of them.

Another way in which we can use simulation to estimate \(\theta\) is to note that if \(g({\bf x})\) is another probability density satisfying \(f({\bf x})=0\) whenever \(g({\bf x})=0\), then we can express \(\theta\) as

\begin{equation}{\label{sch4eq12}}\tag{16}
\theta =\int\frac{h({\bf x})f({\bf x})}{g({\bf x})}g({\bf x})d{\bf x}=\mathbb{E}_{g}\left [\frac{h({\bf X})f({\bf X})}{g({\bf X})}\right ]
\end{equation}

where we have written \(\mathbb{E}_{g}\) to emphasize that the random vector has joint density \(g({\bf x})\). It follows from (\ref{sch4eq12}) that \(\theta\) can be estimated by successively generating values of a random vector \({\bf X}\) having density function \(g({\bf x})\) and then using as the estimator the average of the values of \(h({\bf X})f({\bf X})/g({\bf X})\). If a density function \(g({\bf x})\) can be chosen such that the random variable \(h({\bf X})f({\bf X})/g({\bf X})\) has a smaller variance, then this approach, referred to as  importance sampling, can result in an efficient estimator of \(\theta\).

Let us now try to obtain a feel for why importance sampling can be useful. To begin, note that \(f({\bf X})\) and \(g({\bf X})\) represent the respective likelihoods of obtaining the vector \({\bf X}\) when \({\bf X}\) is a random vector with respective densities \(f\) and \(g\). Hence, if \({\bf X}\) is distributed according to \(g\), then it will usually be the case that \(f({\bf X})\) will be small in relation to \(g({\bf X})\), and thus when \({\bf X}\) is simulated according to \(g\) the likelihood ratio \(f({\bf X})/g({\bf X})\) will usually be small in comparison to \(1\). However, it is easy to check that its mean is \(1\)

\begin{align*} \mathbb{E}_{g}\left [\frac{f({\bf X})}{g({\bf X})}\right ] & =\int\frac{f({\bf x})}{g({\bf x})}g({\bf x})d{\bf x}\\ & =\int f({\bf x})d{\bf x}=1.\end{align*}

We see that even though \(f({\bf X})/g({\bf X})\) is usually smaller than \(1\), its mean is equal to \(1\), thus implying that it occasionally large and so will tend to have a large variance. So how can \(h({\bf X})f({\bf X})/g({\bf X})\) have a small variance? The answer is that we can sometimes arrange to choose a density \(g\) such that those values of \({\bf x}\) for which \(f({\bf x})/g({\bf x})\) are precisely the values for which \(h({\bf x})\) is exceedingly small, and thus the ratio \(h({\bf X})f({\bf X})/g({\bf X})\) is always small. Since this will require that \(h({\bf x})\) is sometimes small, importance sampling seems to work best when estimating a small probability, for in this case the function \(h({\bf x})\) is equal to \(1\) when \({\bf x}\) lies in some set and is equal to \(0\) otherwise. We will now consider how to select an appropriate density \(g\). We will find that the so-called tilted densities are useful. Let

\[M(t)=\mathbb{E}_{f}[e^{tX}]=\int e^{tx}f(x)dx\]

be the moment generating function corresponding to a one-dimensional density \(f\).

A density function

\[f_{t}(x)=\frac{e^{tx}f(x)}{M(t)}\]

is called a tilted density of \(f\) for \(-\infty <t<\infty\). A random variable with density \(f_{t}(x)\) tends to be larger than one with density \(f\) when \(t>0\) and tends to be smaller when \(t<0\). In certain case the tilted densities \(f_{t}\) have the same parametric form as does \(f\).

\begin{equation}{\label{sch4ex14}}\tag{17}\mbox{}\end{equation}

Example \ref{sch4ex14}. If \(f\) is the exponential density with rate \(\lambda\), then \(M(t)=\lambda /(\lambda -t)\) and

\[f_{t}(x)=(\lambda -t)(1/\lambda )e^{tx}\lambda e^{-\lambda x}=(\lambda -t)e^{-(\lambda -t)x}\]

Therefore, for \(t<\lambda\), \(f_{t}\) is an exponential density with rate \(\lambda -t\). If \(f\) is a Bernoulli probability mass function with parameter \(p\), then \(f(x)=p^{x}(1-p)^{1-x}\) for \(x=0,1\). Therefore, we have \(M(t)=\mathbb{E}_{f}[e^{tX}]=pe^{t}+1-p\) and

\[f_{t}(x)=\frac{1}{M(t)}(pe^{t})^{x}(1-p)^{1-x}=\left (\frac{pe^{t}}{pe^{t}+1-p}\right )^{x}\left (\frac{1-p}{pe^{t}+1-p}\right )^{1-x}.\]

That is, \(f_{t}\) is the probability mass function of a Bernoulli random variable with parameter

\[p_{t}=\frac{pe^{t}}{pe^{t}+1-p}.\]

We also can show that if \(f\) is a normal density with parameters \(\mu\) and \(\sigma^{2}\) then \(f_{t}\) is a normal density having mean \(\mu +\sigma^{2}t\) and variance \(\sigma^{2}\). \(\sharp\)

In certain situations, the quantity of interest is the sum of the independent random variables \(X_{1},\cdots ,X_{n}\). In this case the joint density \(f\) is the product of one-dimensional densities. That is,

\[f(x_{1},\cdots ,x_{n})=f_{1}(x_{1})\cdots f_{n}(x_{n})\]

where \(f_{i}(x_{i})\) is the density function of \(X_{i}\). In this situation it is often useful to generate the \(X_{i}\) according to their tilted densities, with a common choice of \(t\) employed.

Example. Let \(X_{1},\cdots ,X_{n}\) be independent random variables having respective probability density (or mass) functions \(f_{i}\) for \(i=1,\cdots ,n\). We are interested in approximating the probability that their sum is at least as large as \(a\), where \(a\) is much larger than the mean of the sum, That is, we are interested in \(\theta =\mathbb{P}\{S\geq a\}\), where \(S=\sum_{i=1}^{n} X_{i}\) and  \(a>\sum_{i=1}^{n}\mathbb{E}[X_{i}]\). Let \(I\{S\geq a\}\) equal \(1\) if \(S\geq a\), and let it be \(0\) otherwise. Then, we have

\[\theta =\mathbb{E}_{{\bf f}}[I\{S\geq a\}]\]

where \({\bf f}=(f_{1},\cdots ,f_{n})\). Suppose that we simulate \(X_{i}\) according to the tilted mass function \(g_{i}=f_{i,t}\) for  \(i=1,\cdots ,n\) with the value of \(t\), \(t>0\), left to be determined. The importance sampling estimator of \(\theta\) would be

\begin{align*} \widehat{\theta} & =I\{S\geq a\}\frac{f({\bf X})}{g({\bf X})}=I\{S\geq a\}\prod_{i=1}^{n}\frac{f_{i}(X_{i})}{g_{i}(X_{i})}\\ & =I\{S\geq a\}\prod_{i=1}^{n}\frac{f_{i}(X_{i})}{f_{i,t}(X_{i})}.\end{align*}

Now, we have

\[\frac{f_{i}(X_{i})}{f_{i,t}(X_{i})}=M_{i}(t)e^{-tX_{i}}\]

which says

\[\widehat{\theta}=I\{S\geq a\}M(t)e^{-tS}\]

where \(M(t)=\prod_{i=1}^{n} M_{i}(t)\) is the moment generating function of \(S\). Since \(t>0\) and \(I\{S\geq a\}\) is equal to \(0\) when \(S<a\), it follows \(I\{S\geq a\}e^{-tS}\leq e^{-ta}\), which implies

\begin{equation}{\label{sch4eq14}}\tag{18}
\widehat{\theta}\leq M(t)e^{-ta}.
\end{equation}

To make the bound on the estimator as small as possible we thus choose \(t>0\), to minimize \(M(t)e^{-ta}\). In doing so, we will obtain an estimator whose value on each iteration is between \(0\) and \(\min_{t} M(t)e^{-ta}\). It can be shown that the minimizing \(t\), call it \(t^{*}\), is such that

\[\mathbb{E}_{t^{*}}[S]=\mathbb{E}_{t^{*}}\left [\sum_{i=1}^{n} X_{i}\right ]=a\]

where, in the preceding, we mean that the expected value is to be taken under the assumption that the distribution of \(X_{i}\) is \(f_{i,t^{*}}\) for \(i=1,\cdots ,n\). For instance, suppose that \(X_{1},\cdots ,X_{n}\) are independent Bernoulli random variables having respective parameters \(p_{i}\) for \(i=1,\cdots ,n\). Then if we generate the \(X_{i}\) according to their tilted mass function \(f_{i,t}\) for \(i=1,\cdots ,n\), then the importance sampling estimator of \(\theta =\mathbb{P}\{S\geq a\}\) is

\[\widehat{\theta}=I\{S\geq a\}e^{-tS}\prod_{i=1}^{n} (p_{i}e^{t}+1-p_{i})\]

Since \(f_{i,t}\) is the mass function of Bernoulli random variable with parameter \((p_{i}e^{t})/(p_{i}e^{t}+1-p_{i})\), it follows

\[\mathbb{E}_{t}\left [\sum_{i=1}^{n} X_{i}\right ]=\sum_{i=1}^{n}\frac{p_{i}e^{t}}{p_{i}e^{t}+1-p_{i}}.\]

The value of \(t\) that makes the preceding equal to \(a\) can be numerically approximated and the \(t\) utilized in the simulation. As an illustration, suppose that \(n=20\), \(p_{i}=0.4\), \(a=16\). Then, we have

\[\mathbb{E}_{t}[S]=20\cdot\frac{0.4e^{t}}{0.4e^{t}+0.6}.\]

Setting this equal to \(a=16\) yields after a little algebra \(e^{t^{*}}=6\). If we generate the Bernoulli using the parameter

\[\frac{0.4e^{t^{*}}}{0.4e^{t^{*}}+0.6}=0.8,\]

then, as \(M(t^{*})=(0.4e^{t^{*}}+0.6)^{20}=3^{30}\) and  \(e^{-t^{*}S}=\left (\frac{1}{6}\right )^{S}\), we see that the importance sampling estimator is

\[\widehat{\theta}=I\{S\geq a\}\cdot\left (\frac{1}{6}\right )^{S}\cdot 3^{20}.\]

It follows, from (\ref{sch4eq14}), that we have

\[\widehat{\theta}\leq\left (\frac{1}{6}\right )^{16}3^{20}=0.001236.\]

That is, on each iteration the value of estimator is between \(0\) and \(0.001236\). Since, in this case, \(\theta\) is the probability that a binomial random variable with parameters \(20\), \(0.4\) is at least \(16\), it can be explicitly computed with the result \(0.000317\). Hence the raw simulation estimator \(I\), which on each iteration takes the value \(0\) if the sum of the Bernoulli with parameter \(0.4\) is less that \(16\) and takes the value \(1\) otherwise, will have the variance

\[\mbox{Var}(I)=\theta (1-\theta )=3.169\times 10^{-4}.\]

On the other hand, it follows, from  \(0\leq\theta\leq 0.001236\), that we have

\[\mbox{Va}r(\widehat{\theta})\leq 2.9131\times 10^{-7}.\]

Example. Consider a single server queue in which the times between successive customer arrivals are random variables \(X_{i}\) having density function \(f\) and the service times are random variable \(Y_{i}\) having density function \(g\). Let \(D_{n}\) denote the amount of time that the \(n\)th arrival spends waiting in queue and suppose we are interested in estimating \(\theta =\mathbb{P}\{D_{n}\geq a\}\) when \(a\) is much larger than \(\mathbb{E}[D_{n}]\). Rather than generating the successive inter-arrival and service times according of \(f\) and \(g\), respectively, we should generate them according to their tilted densities \(f_{t}\) and \(g_{t}\), where \(t\) is a positive number to be determined. Note that using these distributions as opposed to \(f\) and \(g\) will result in smaller inter-arrival times (since \(-t<0\)) and larger service times. Hence, there will be a greater chance that \(D_{n}\geq a\) than if we had simulated using the densities \(f\) and \(g\). The importance sampling estimator for \(\theta\) would then be

\[\widehat{\theta}=I\{D_{n}\geq a\}\prod_{i=1}^{n} \frac{f(X_{i})g(Y_{i})}{f_{-t}(X_{i})g_{t}(Y_{i})}\]

Now, we have
\[\frac{f(X_{i})}{f_{-t}(X_{i})}=M_{f}(-t)e^{tX_{i}}\mbox{ and }
\frac{g(Y_{i})}{g_{t}(Y_{i})}=M_{g}(t)e^{-tY_{i}},\]

which imply

\[\widehat{\theta}=I\{D_{n}\geq a\}e^{t(S_{n}-T_{n})}(M_{f}(-t)M_{g}(t))^{n}\]

where \(S_{n}=\sum_{i=1}^{n} X_{i}\) is the sum of the first \(n\) inter-arrival times, \(T_{n}=\sum_{i=1}^{n} Y_{i}\) is the sum of the first \(n\) service times, and \(M_{f}\) and \(M_{g}\) are the moment generating functions of the densities \(f\) and \(g\), respectively. The value of \(t\) used should be determined by experimenting with a variety of different choices. \(\sharp\)

Importance sampling is also quite useful in estimating a conditional expectation when one is conditioning on a rare event. That is, suppose \({\bf X}\) is a random vector with density function \(f\) and that we are interested in estimating

\[\theta =\mathbb{E}[h({\bf X})|{\bf X}\in {\cal A}]\]

where \(h({\bf x})\) is an arbitrary real-valued function and where \(\mathbb{P}\{{\bf X}\in {\cal A}\}\) is a small unknown probability. Since the conditional density of \({\bf X}\) given that it lies in \({\cal A}\) is

\[f({\bf x}|{\bf X}\in {\cal A})=\frac{f({\bf x})}{P\{{\bf X}\in {\cal A}\}},{\bf x}\in {\cal A}\]

we have t

\[\theta =\frac{\int_{{\bf X}\in {\cal A}} h({\bf x})f({\bf x})d{\bf x}}{\mathbb{P}\{{\bf X}\in {\cal A}\}}=\frac{\mathbb{E}[h({\bf X})I({\bf X}\in {\cal A})]}{\mathbb{E}[I({\bf X}\in {\cal A})]}=\frac{\mathbb{E}(N)}{\mathbb{E}(D)}\]

where \(\mathbb{E}(N)\) and \(\mathbb{E}(D)\) are defined to equal the numerator and denominator in the preceding, and \(I({\bf X}\in {\cal A})\) is defined to be \(1\) if \({\bf X}\in {\cal A}\) and \(0\) otherwise. Hence, rather than simulating \({\bf X}\) according to the density \(f\), which would make it very unlikely to be in \({\cal A}\), we can simulate it according to some other density \(g\) which makes this event more likely. If we simulate \(k\) random vectors \({\bf X}_{1},\cdots ,{\bf X}_{k}\) according to \(g\), then we can estimate \(\mathbb{E}(N)\) by \((1/k)\sum_{i=1}^{k} N_{i}\) and \(\mathbb{E}(D)\) by \((1/k)\sum_{i=1}^{n}D_{i}\), where

\[N_{i}=\frac{h({\bf X}_{i})I({\bf X}_{i}\in {\cal A})f({\bf X}_{i})}{g({\bf X}_{i})}\mbox{ and }
D_{i}=\frac{I({\bf X}_{i}\in {\cal A})f({\bf X}_{i})}{g({\bf X}_{i})}\]

Thus, we obtain the following estimator of \(\theta\)

\[\widehat{\theta}=\frac{\sum_{i=1}^{k} h({\bf X}_{i})I({\bf X}_{i}\in {\cal A})f({\bf X}_{i})/g({\bf X}_{i})}
{I({\bf X}_{i}\in {\cal A})f({\bf X}_{i})/g({\bf X}_{i})}\]

Example. Let \(X_{i}\) be independent exponential random variables with respective rates \(1/(i+2)\), \(i=1,2,3,4\). Let \(S=\sum_{i=1}^{4} X_{i}\), then \(S\) is also an exponential random variables with rate \(\sum_{i=1}^{4} 1/(i+2)\). We want to estimate \(\theta =\mathbb{E}[S|S>62]\). To accomplish this, we can use importance sampling with the tilted densities. That is, we can choose a value \(t\) and then simulate the \(X_{i}\) with rates \(1/(i+2)-t\) by referring to Example \ref{sch4ex14}. If we choose \(t=0.14\), then \(E_{t}[S]=68.43\). So, let us generate \(k\) sets of exponential random variables \(X_{i}\) with rates \(1/(i+2)-0.14\), \(i=1,2,3,4\), and let \(S_{j}\) be the sum of the \(j\)th set for \(j=1,\cdots ,k\). Since

\[\theta =\mathbb{E}[S|S>62]=\frac{\mathbb{E}[SI(S|S>62)]}{\mathbb{E}[I(S>62)]}\]

and

\begin{align*} \frac{f({\bf X})}{g({\bf X})} & =\frac{\prod_{i=1}^{4} f_{i}(X_{i})}{\prod_{i=1}^{4} f_{i,t}(X_{i})}\\ & =\prod_{i=1}^{4} \frac{\lambda e^{-\lambda X_{i}}}{\lambda e^{-(\lambda -X_{i})}/M_{i}(t)}=\prod_{i=1}^{n} M_{i}(t)e^{-tS}\end{align*}

we can estimate \(\mathbb{E}[SI(S>62)]\) by

\[\frac{1}{k}\sum_{j=1}^{k}\frac{S_{j}I(S_{j}>62)f({\bf X}_{j})}{g({\bf X}_{j})}=\frac{\prod_{i=1}^{4} M_{i}(t)}{k}\sum_{j=1}^{k} S_{j}I(S_{j}>62)e^{-0.14S_{j}}\]

and \(\mathbb{E}[I(S>62)]\) by

\[\frac{1}{k}\sum_{j=1}^{k}\frac{I(S_{j}>62)f({\bf X}_{j})}{g({\bf X}_{j})}=\frac{\prod_{i=1}^{4} M_{i}(t)}{k}\sum_{j=1}^{k} I(S_{j}>62)e^{-0.14S_{j}}.\]

The estimator of \(\theta\) is

\[\widehat{\theta}=\frac{\sum_{j=1}^{k} S_{j}I(S_{j}>62)e^{-0.14S_{j}}}{\sum_{j=1}^{k} I(S_{j}>62)e^{-0.14S_{j}}}\]

The importance sampling approach is also useful in that it enables us to estimate two (or more) distinct quantities in a single simulation. For example, suppose that \(\theta_{1}=\mathbb{E}[h({\bf Y})]\) and \(\theta_{2}=\mathbb{E}[h({\bf W})]\), where \({\bf Y}\) and \({\bf W}\) are random vectors having joint density functions \(f\) and \(g\), respectively. If we now simulate \({\bf W}\), we can simultaneously use \(h({\bf W})\) and \(h({\bf W})f({\bf W})/g({\bf W})\) as estimators of \(\theta_{2}\) and \(\theta_{1}\), respectively. For example, suppose we simulate \(T\), the total time in the system of the first \(r\) customers in a queuing system in which the service distribution is exponential with mean \(2\). If we now decide that we really should have considered the same system but with a service distribution that is gamma distributed with parameters \((2,1)\), then it is not necessary to repeat the simulation, because we can just use the estimator

\[T\cdot\frac{\prod_{i=1}^{r} S_{i}e^{-S_{i}}}{\prod_{i=1}^{r}\left (\frac{1}{2}e^{-S_{i}/2}\right )}=2^{r}\cdot T\cdot\exp\left [
-\sum_{i=1}^{r}\frac{S_{i}}{2}\right ]\prod_{i=1}^{r} S_{i}\]

where \(S_{i}\) is the (exponentially) generated service time of customer \(i\). (The above follows since the exponential service time density is \(g(s)=(1/2)e^{-S/2}\), whereas the gamma \((2,1)\) density is \(f(s)=se^{-s}\).)

 

 

Hsien-Chung Wu
Hsien-Chung Wu
文章: 183

發佈留言

發佈留言必須填寫的電子郵件地址不會公開。 必填欄位標示為 *