Sampling

How Well Does a Sample Describe the Population?

All files, sofware, and tutorials that make up SABLE Copyright (c) 1997 1998 1999 Virginia Tech. You may use these programs under the conditions of the SABLE General License, which incorporates the GNU GENERAL PUBLIC LICENSE.

1. Approximating
the Population Mean.

You will be given a "population" of circles whose mean (average) radius will be shown in the panel below the circles. Click on circles to collect a small sample whose mean radius is close to the population mean. The mean of your sample will also be displayed below the circles. Compare your sample mean to the population mean, and select larger or smaller circles to bring your sample mean within the target range. Once you reach your target, the results are displayed, and then the exercise will repeat with a new target that is closer to the population mean.

Try to keep your samples as small as possible. Notice what happens to your sample size as the target approaches the true mean. You will probably have to choose circles carefully to avoid including the entire population in your sample.

You will be given a "population" of circles whose mean (average) radius will be shown in the panel below the circles. Click on circles to collect a small sample whose mean radius is close to the population mean. The mean of your sample will also be displayed below the circles. Compare your sample mean to the population mean, and select larger or smaller circles to bring your sample mean within the target range. Once you reach your target, the results are displayed, and then the exercise will repeat with a new target that is closer to the population mean.

Try to keep your samples as small as possible. Notice what happens to your sample size as the target approaches the true mean. You will probably have to choose circles carefully to avoid including the entire population in your sample.

You can see from this activity that it is possible for a sample of circles to provide information about all the circles in the population, even though only a few are included in the sample. With practice, you can select a small sample that provides an estimate that is very close to the true value. In general, however, creating a sample that is a better representation of the entire population requires more individuals than creating a less accurate representation.

We can refer to samples compiled at random as Random Samples, meaning that individuals are selected from a population, so that each member of the population has a known chance of selection for a sample. There is an important distinction between an arbitrarily selected individual, and one that is truly random. If we select an individual to become a member of a sample haphazardly, without careful thought, we may meet an everyday definition of randomness. But we may in fact be selecting only individuals that are convenient to collect, or that meet an investigator's inaccurate notion of what constitutes a "good" representative of the population.

In contrast, true randomness is assured when the sampling procedure is carefully designed to remove any element of choice. Random samples are collected according to a procedure (sometimes known as a protocol) specifically tailored for the individual study, that gives precise steps to assure that each individual in the population has a known chance of selection. Usually such a procedure is based upon use of a table of random numbers to select the sample.

2. The Sampler

Here is another population of circles, whose mean diameter is given. The distribution of circle diameters is presented in a histogram (chart) as well. Type a sample size in the text area and click the "Take a Sample" button. A sample of that size will be randomly collected from the population. The sample mean diameter will be displayed, and the distribution will be drawn on top of the population histogram.

Experiment with sample sizes from very small to nearly all the population. Notice how results vary when you use separate trials with a constant sample size (click on "Take a Sample" several times without changing the sample size). Observe that results vary most noticeably when sample sizes are small.

Here is another population of circles, whose mean diameter is given. The distribution of circle diameters is presented in a histogram (chart) as well. Type a sample size in the text area and click the "Take a Sample" button. A sample of that size will be randomly collected from the population. The sample mean diameter will be displayed, and the distribution will be drawn on top of the population histogram.

Experiment with sample sizes from very small to nearly all the population. Notice how results vary when you use separate trials with a constant sample size (click on "Take a Sample" several times without changing the sample size). Observe that results vary most noticeably when sample sizes are small.

Selecting a small sample size does not necessarily mean that the estimate will be inaccurate, and similarly, choosing a larger sample size will not guarantee accurate estimates. But, if (for example) we pick ten samples of 5 individuals each, we will tend, on the average, to obtain less accurate estimates than if we pick ten samples of 50 individuals each. Notice also that estimates of the mean from a series of larger samples tend to vary less than do those from a series of smaller samples.

You can prove these facts to yourself using the Sampler activity! Take ten small samples and count how many times the sample mean was within 0.5 of the population mean. Now do the same thing for a larger sample. How small a sample can you use and still have your sample mean in this range 80% of the time? (Results will vary.)

3. Means of Repeated Samples

In the following activity, type in a value for "sample size" and then click the "Get Samples" button. You will see a frequency histogram of the means of 200 different samples of that size. Start with a small sample size, like 10, and work up to larger samples. Notice how the histogram shows fewer different values for means of larger sample sizes, indicating that these means tend to converge to the population mean.

In the following activity, type in a value for "sample size" and then click the "Get Samples" button. You will see a frequency histogram of the means of 200 different samples of that size. Start with a small sample size, like 10, and work up to larger samples. Notice how the histogram shows fewer different values for means of larger sample sizes, indicating that these means tend to converge to the population mean.

This activity illustrates the effect of choosing specific sample sizes. Small sample sizes, such as 10 or 15 for example, have a mean that tends to vary widely, as revealed by the wide range in the histogram of estimates. Increasing the sample size to 50 or 60, for example, reduces the wide swings in estimates of the overall mean.

Furthermore, as sample size increases, the shape of the histogram tends to assume a more symmetric shape, with a single large value near the center of the distribution and declining numbers on either side, taking a bell-shape form as they decline to smaller frequencies.

This general shape, known as the normal frequency distribution,
is characteristic of data acquired by random sampling.
If we randomly select samples from populations, even populations with
non-normal distributions, the frequency distributions of sample means
will begin to approximate the normal frequency distribution as sample
size increases.
This important concept is known as the
** Central Limit Theorem**.

These results illustrate why increasing sample size is so valuable. Randomness avoids introducing investigator bias. Random sampling causes the means for a series of samples to approximate the normal frequency distribution, which allows us to use other statistical tests to estimate the reliability of samples. Large sample sizes minimize sampling error, assuring that information from the sample is as accurate as possible.

Another useful application of stratified data is to make corrections to account for groups known to be underrepresented in a set of data. For example, about half the population are women. If your survey results include only 25% women, you may want to use a higher fraction of the women's responses to obtain a representation of men and women equal to that in the population.

Within strata, individuals are collected randomly. In the following picture, the population has been divided into three strata each of different size. To get samples for each strata of the same size, the researcher must sample a higher percentage of the population from the third strata.

We can also stratify data to detect similarities or differences across different groups. We might want to answer questions such as:

- Do older women tend to vote Republican more than younger women?
- Is there a difference between attitudes of younger and older women on the topic of abortion?
- Is there a difference between rural and urban people in their support for capital punishment?

4. Stratification

Choose a variable and stratification scheme and compare samples from different strata. Answer questions in the panel on the right.

Choose a variable and stratification scheme and compare samples from different strata. Answer questions in the panel on the right.

Stratifying allows us to compare different segments of a population, but
in these examples, we have not really discussed
how to *decide* if two samples are different. If 32%
of the women under 40 years of age vote Republican,
and 33% of the women over 40 vote Republican, then
it is reasonable to conclude that the 1%
difference does not signify a genuine difference between
the two groups.
It is more likely to result from the variance caused by a samll sample
size.
But what conclusion do we draw when the difference
is 2%? 3%? How large does the difference
have to become before we can be confident that we are
observing genuine differences between the two groups?
A large part of the field of statistics is devoted to
providing clear, systematic procedures for answering such
questions. In our system of tutorials, we introduce these
topics in Measures of Dispersion,
which is devoted to describing variation within samples, and
Hypothesis Testing,
which examines how to compare samples.

Effort is required to collect every sample. The activities in this tutorial do not show that costs in time and effort increase as a researcher increases sample size. Thus, researchers have powerful economic incentives to minimize the costs of sampling. The challenge in designing a sampling plan is to strike an effective balance between the statistical advantages of large sample sizes and the cost advantages of small sample sizes.

**Return to **Table
of Contents