Sampling Distribution for Means and Proportions

Recall that a statistic is a number that is calculated from a random sample. Before the sample is taken the value of the statistic is random and the statistic is a random variable. The distribution of a statistic for random samples of a certain sample size is called the sampling distribution.

The population mean \(\mu\) is estimated by the sample mean \(\bar{x},\) and the population proportion \(p\) is estimated by the sample proportion \(\hat{p}.\) For this reason the distribution of these statistics are of interest. In this lesson formulas are derived for the mean, variance, and standard deviation of these statistics. The distribution of each of these statistics can be approximated by a normal distribution, provided the sample size is sufficiently large. This fact follows from the Central Limit Theorem.

Central Limit Theorem

Why is it that many random variables are, at least approximately, normally distributed?

It turns out that the sum of many independent and identically distributed random variables has approximately the normal distribution.

Central Limit Theorem (Simplified Classical Version). Suppose that \(X_1, X_2, X_3, \ldots\) is a sequence of independent and identically distributed random variables. Let \[S_n = X_1 + X_2 + \ldots + X_n.\] Then the random variable \(S_n\) has approximately the normal distribution provided that \(n\) is sufficiently large.

A precise statement of this theorem requires knowledge of Calculus. There are some additional technical assumptions that need to be satisfied in order for the theorem to be true, and there are many variations of this theorem. For instance, it is not necessary that all the random variables be identically distributed and it is also not necessary that all variables be independent, provided that some other additional technical assumptions hold.

In practical terms, the Central Limit Theorem says that an effect, which is the sum of many other effects, has approximately a normal distribution.

The picture below shows the probability density function of the sum of \(n\) independent uniformly distributed random variables (in blue) together with the probability density function of the corresponding normal approximation (in red) for \(n = 1, 2, 3, 4.\)

centralLimitIllustration

Note that the sum of only 3 independent uniformly distributed random variables already resembles the normal distribution.

Distribution of a Sample Proportion

You have a box with a large number of orange and blue beads and you want to know the proportion of orange beads in the box. Since there are too many to count them all, you decide to select a simple random sample of size \(n\) from the box. Let \(X\) be the number of orange beads in the sample and let \[\hat p = \frac{X}{n}. \] Then \(\hat p\) is the proportion of orange beads in the sample and could be used as an estimate of the proportion \(p\) of the orange beads in the box.

Suppose we select the beads of the sample one by one. Each time we take another bead the proportion of orange beads in the box changes. However, if the number of beads in the box is large when compared to the sample size, then the proportion of the orange beads stays approximately the same during the sampling, and the distribution of \(X\) is approximately Binomial with parameters \(n\) and \(p.\)

If \(X\sim B(n, p)\) then using the rules for the expected value and the variance we find that \[E[\hat p] =E\left[\frac{X}{n}\right] = \frac{1}{n}E[X] = \frac{1}{n}np = p \] and \[Var(\hat p) = Var\left(\frac{X}{n}\right) = \frac{1}{n^2}Var(X) = \frac{1}{n^2}np(1-p) = \frac{p(1-p)}{n}. \] The standard deviation of \(\hat p\) is obtained by taking the square root of the variance.

We can summarize the formulas as follows. \[\mu_{\hat p} = p, \quad \sigma^2_{\hat p} = \frac{p(1-p)}{n}, \quad \mbox{ and } \sigma_{\hat p} = \sqrt{\frac{p(1-p)}{n}}. \]

The random variable \(X\) is the sum of \(n\) independent identically distributed random variables and therefore has approximately the normal distribution, provided that \(n\) is sufficiently large. In fact, the distribution of \(X\) is approximately \(N(\mu=np, \sigma=\sqrt{np(1-p)}).\)

If some random variable \(X\) has (approximately) a normal distribution then the random variable \(X/n\) also has (approximately) a normal distribution. Therefore the sample proportion \(\hat p = X/n\) has (approximately) the distribution \(N(\mu=p, \sigma=\sqrt{\frac{p(1-p)}{n}}).\)

Since the approximation is only good for sufficiently large sample sizes, one needs to determine whether or not the sample size is sufficiently large. Some authors suggest that the approximation is valid if \(np\geq 10\) and \(n(1-p)\geq 10.\) Others suggest that it suffices if \(np\geq 5\) and \(n(1-p)\geq 5.\) There are also varies techniques for improving the normal approximation.

With the availability of statistical software such as R there is usually no need to approximate the distribution of the sample proportion. We will, however, use the normal approximation for determining appropriate sample sizes in a another lesson.

Distribution of a Sample Mean

Consider the population of Freshman students at Auburn University in 2010. Suppose we randomly select a Freshman and measure his weight before and after the Fall semester. Let \(X\) be the change in weight measured in pounds, where a positive number denotes a gain in weight and a negative number a loss in weight.

Denote the expected value of \(X\) by \(\mu_X.\) Then \(\mu_X\) is the average change in weight of all the Freshman students at Auburn University. Suppose we take a sample of size \(n\) and denote the sample mean by \(\bar x.\) Note that \[\bar x = \frac{X_1 + X_2 + \ldots + X_n}{n}. \]

Using the rules for expectations we find that \[\begin{align*} E[\bar x] & = E\left[\frac{1}{n}(X_1 + X_2 + \ldots + X_n)\right] \\ & = \frac{1}{n}E[X_1 + X_2 + \ldots + X_n]\\ & = \frac{1}{n}(E[X_1] + E[X_2] + \ldots + E[X_n])\\ & = \frac{1}{n}(\mu_X + \mu_X + \ldots + \mu_X)\\ & = \frac{1}{n}n\mu_X\\ & = \mu_X. \end{align*} \] In words, the expected value of the sample mean is the population mean.

Using the rules for variances we find that \[\begin{align*} Var(\bar x) & = Var \left(\frac{1}{n}(X_1 + X_2 + \ldots + X_n)\right) \\ & = \frac{1}{n^2}Var(X_1 + X_2 + \ldots + X_n)\\ & = \frac{1}{n^2}(Var(X_1) + Var(X_2) + \ldots + Var(X_n))\\ & = \frac{1}{n^2}(\sigma^2_X + \sigma^2_X + \ldots + \sigma^2_X)\\ & = \frac{1}{n^2}n\sigma^2_X\\ & = \frac{\sigma^{2}_{X}}{n}. \end{align*} \] The standard deviation is the square root of the variance and therefore \[\sigma_{\bar x} = \frac{\sigma_X}{\sqrt{n}}.\] In words, the standard deviation of the sample mean is the standard deviation of the population divided by the square root of the sample size.

In summary, if a population has mean \(\mu\) and standard deviation \(\sigma\) then \[\mu_{\bar x} = \mu \mbox{ and } \sigma_{\bar x} = \frac{\sigma}{\sqrt{n}}.\]

In the derivation of the variance of the sample mean we used the rule that the variance of a sum of random variables is equal to the sum of the variances of the random variables. This rule is only true if the random variables are independent. The random variables are independent if we pick the students one by one such that in every selection each student has the same chance of being selected. Proceeding in this way could result in selecting the same student more than once. If we select a simple random sample of size \(n\) then the random variables are, strictly speaking, not independent. However, if the sample size is small when compared to the population, say less than 5%, then the final result is still approximately true. If the sample size is not small when compared with the population size then we need to do a more careful analysis in deriving the standard deviation of the sample mean. We do not consider this case in this course.

Since the sample mean is the sum of \(n\) independent (at least approximately independent) and identically distributed random variables divided by \(n\), the sample mean has approximately a normal distribution, provided that the sample size is sufficiently large. As a rule of thumb, if nothing else is known about the distribution of the population, \(n\geq 30\) is considered to be sufficiently large.

For \(n\) sufficiently large (\(n\geq 30\)), the sample mean is approximately normally distributed with a mean that is equal to the population mean and a standard deviation that is equal to the population standard deviation divided by the square root of \(n.\) In formulas, \[\bar x \sim N(\mu = \mu_X, \sigma = \frac{\sigma_X}{\sqrt{n}}).\]

We can use this result to gain insight as to how accurate the estimate \(\bar x\) of \(\mu\) is. For a sufficiently large sample size, the distribution of \(\bar x\) is approximately normal, and by the Empirical Rule we can be 95% confident that \(\bar x\) is within 2 standard deviations of \(\mu,\) that is, we can be 95% confident that the absolute error is at most \(\frac{2\sigma_X}{\sqrt{n}}.\) Furthermore, the formula shows that quadrupling the sample size is likely to cut the absolute error in half. To be 95% confident about a statement means that if we would repeat the same method independently many times then in about 95% of the times the statement would be true.