Statistical FAQ

Questions:

  1. Sample vs. Population:  Why does Statistical only use population statistics?

Q1: Sample vs. Population  

i.e. "Why does Statistical only use population statistics?"

Firstly, it should be pointed out that wether or not you use a formula that includes μ or X¯ (X with the bar over it), they mean the same thing. Same with n or N for being the number of points in a data set.

The variance of a data set, be it sample or population, is defined to be:

∑(X-μ)² / N

or

∑(X-X¯)² / n

The confusion comes from some text books referring to the variance of a Sample to be divided by (n-1)

From p.87 of Statistical Reasoning by King&Minimum,

Point of Controversy (Calculating the Sample Variance: Should we divide by n or (n-1)?)

...

Rarely are we interested in the sample itself. Instead, we obtain samples and calculate statistics to make inferences about population parameters. Due to chance factors in drawing a sample, we do not expect our sample statistic (e.g. sample mean, sample variance) to exactly equal the population parameter (e.g., population mean, population variance), but we hope it is close.

...

To correct for this underestimation in the numerator of the formula for the sample variance, statisticians divide SS by (n-1) instead of n. The result is no longer the variance of the sample, but what is called the unbiased estimate of the population variance. The value of ∑(X-X¯)² / (n-1) will still not be equal to the population variance in most cases, but instead of a biased estimate (tending to be an underestimate), there is an equal likelihood of the value falling below or above the value of the population variance.

So, which formula is correct? Actually, they both are. To obtain the variance for any given set of scores, we always divide by n. Thus, the variance of a sample is calculated by dividing by n. Ultimately, however, we are interested in estimating the population variance, and to get an unbiased estimate of that we divide by (n-1).

I have considered adding a row with "Unbiased estimate of the population variance", and originally in early versions I had a whole section with for Population vs Sample, however I feel like the added complexity will really only serve to confuse people as it is a confusing subject anyway.

Ultimately I decided that since I already provide the sum of squares, and there is no difference between population and sample sum of squares, and I provide the raw number of inputs, it would be trivial to calculate this on your own should you need it.

Adding a mode to switch back and forth between symbols and calculations using sample vs population is a good idea. However, I believe the usability of Statistical would suffer from it. (I still receive emails asking why they cannot see scatter plots, i.e. they never switched from single variable to multivariable inputs.)

If you have any further ideas on the matter I would love to hear from you, post in the comments or drop me a line using my Contact form.