Subject 6. Confidence intervals for the population mean.

CFA Program Curriculum (2014), Volume 1, page 554.

Confidence intervals are typically constructed by using the following structure:

Confidence Interval = Point Estimate ± Reliability Factor x Standard Error

  • Point estimate is the value of a sample statistic of the population parameter.
  • Reliability factor is a number based on the sampling distribution of the point estimate and the degree of confidence (1 - α).
  • Standard error refers to the standard error of the sample statistic that is used to produce the point estimate.

Whatever the distribution of the population, the sample mean is always the point estimate used to construct the confidence intervals for the population mean. The reliability factor and the standard error, however, may vary depending on three factors:
  1. Distribution of population: normal or non-normal.
  2. Population variance: known or unknown.
  3. Sample size: large or small.
z-Statistic: a standard normal random variable

If a population is normally distributed with a known variance, z-statistic is used as the reliability factor to construct confidence intervals for the population mean.

In practice, the population standard deviation is rarely known. However, learning how to compute a confidence interval when the standard deviation is known is an excellent introduction to how to compute a confidence interval when the standard deviation has to be estimated.

Three values are used to construct a confidence interval for μ:
  1. The sample mean (m);
  2. The value of z (which depends on the level of confidence), and
  3. The standard error of the mean (σ)m.
The confidence interval has m for its center and extends a distance equal to the product of z and in both directions. Therefore, the formula for a confidence interval is:

m - z σm <= μ <= m + z σm

For a (1 - α)% confidence interval for the population mean, the z-statistic to be used is zα/2. zα/2 denotes the points of the standard normal distribution such that α/2 of the probability falls in the right-hand tail.

Effectively, what is happening is that the (1 - α)% of the area that makes up the confidence interval falls in the center of the graph, that is, symmetrically around the mean. This leaves α% of the area in both tails, or α/2 % of area in each tail.

Commonly used reliability factors are as follows:
  • 90% confidence intervals: z0.05 = 1.645. α is 10%, with 5% in each tail.
  • 95% confidence intervals: z0.025 = 1.96. α is 5%, with 2.5% in each tail.
  • 99% confidence intervals: z0.005 = 2.575. α is 1%, with 0.5% in each tail.

Example

Assume that the standard deviation of SAT verbal scores in a school system is known to be 100. A researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random sample of 10 scores.

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780. Therefore, m = 530, N = 10, and σm= 100 / 101/2 = 31.62. The value of z for the 95% confidence interval is the number of standard deviations one must go from the mean (in both directions) to contain .95 of the scores.

It turns out that one must go 1.96 standard deviations from the mean in both directions to contain .95 of the scores. The value of 1.96 was found using a z table. Since each tail is to contain .025 of the scores, you find the value of z for which 1 - 0.025 = 0.975 of the scores are below. This value is 1.96.

All the components of the confidence interval are now known: m = 530, σm = 31.62, z = 1.96.
Lower limit = 530 - (1.96)(31.62) = 468.02
Upper limit = 530 + (1.96)(31.62) = 591.98

Therefore, 468.02 ≤ μ ≤ 591.98. This means that the experimenter can be 95% certain that the mean SAT in the school system is between 468 and 592. This also means if the experimenter repeatedly took samples from the population and calculated a number of different 95% confidence intervals using the sample information, on average 95% of those intervals would contain μ. Notice that this is a rather large range of scores. Naturally, if a larger sample size had been used, the range of scores would have been smaller.

The computation of the 99% confidence interval is exactly the same except that 2.58 rather than 1.96 is used for z. The 99% confidence interval is: 448.54 <= μ <= 611.46. As it must be, the 99% confidence interval is even wider than the 95% confidence interval.

Summary of Computations
  1. Compute m = ∑X/N.
  2. Compute σm = σ/N1/2
  3. Find z (1.96 for 95% interval; 2.58 for 99% interval)
  4. Lower limit = m - z σm
  5. Upper limit = m + z σm
  6. Lower limit <= μ <= Upper limit
Assumptions:
  1. Normal distribution
  2. σ is known
  3. Scores are sampled randomly and are independent
There are three other points worth mentioning here:
  1. The point estimate will always lie exactly at the midway mark of the confidence interval. This is because it is the "best" estimate for μ, and so the confidence interval expands out from it in both directions.
  2. The higher the percentage of confidence, the wider the interval will be. This is because as the percentage is increased, a wider interval is needed to give us a greater chance of capturing the unknown population value within that interval.
  3. The width of the confidence interval is always twice the part after the positive or negative sign, that is, twice the reliability factor x standard error. The width is simply the upper limit minus the lower limit.
It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both μ and σ.

Students' t-Distribution

When σ is known, the formula m - z σm <= μ <= m + z σm is used for a confidence interval. When σ is not known, σm = s/N1/2 (N is the sample size) is used as an estimate of σ and μ. Whenever the standard deviation is estimated, the t rather than the normal (z) distribution should be used. The values of t are larger than the values of z so confidence intervals when σ is estimated are wider than confidence intervals when σ is known. The formula for a confidence interval for μ when σ is estimated is:

m - t sm <= μ <= m + t sm

where m is the sample mean, sm is an estimate of σm, and t depends on the degrees of freedom and the level of confidence.

The t-distribution is a symmetrical probability distribution defined by a single parameter known as degrees of freedom (df). Each value for the number of degrees of freedom defines one distribution in this family of distributions. Like a standard normal distribution (e.g. a z-distribution), the t-distribution is symmetrical around its mean. Unlike a standard normal distribution, the t-distribution has the following unique characteristics.
  • It is an estimated standardized normal distribution. When n gets larger, t approximates z (s approaches σ).
  • The mean is 0, and the distribution is bell-shaped.
  • There is not one t-distribution, but a family of t-distributions. All t-distributions have the same mean of 0. Standard deviations of these t-distributions differ according to the sample size, n.
  • The shape depends on degrees of freedom (n - 1). The t-distribution is less peaked than a standard normal distribution, and has fatter tails (i.e. more probability in the tails).
  • tα/2 tends to be greater than zα/2 for a given level of significance, α.
  • Its variance is v/(v-2) (for v > 2), where v = n-1. It is always bigger than 1. As v increases, the variance approaches 1.

The value of t can be determined from a t table. The degrees of freedom for t is equal to the degrees of freedom for the estimate of σm which is equal to N-1.

A portion of t-table is presented as below:

Suppose the sample size (n) is 30, and the level of significance (α) is 5%. df = n - 1 = 29. tα/2 = t0.025 = 2.045 (Find the 29 df row, and then move to the 0.05 column).

Example

Assume a researcher is interested in estimating the mean reading speed (number of words per minute) of high-school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken and the reading speeds were: 200, 240, 300, 410, 450, and 600. For these data,
  • m = 366.6667
  • sm = 60.9736
  • df = 6-1 = 5
  • t = 2.571

Therefore, the lower limit is: m - (t) (sm) = 209.904 and the upper limit is: m + (t) (sm) = 523.430. Therefore, the 95% confidence interval is:
209.904 <= μ <= 523.430

Thus, the researcher can be 95% sure that the mean reading speed of high-school graduates is between 209.904 and 523.430.

Summary of Computations
  1. Compute m = ∑X/N.
  2. Compute s
  3. Compute σm = s/N1/2
  4. Compute df = N-1
  5. Find t for these df using a t table
  6. Lower limit = m - t sm
  7. Upper limit = m + t sm
  8. Lower limit <= μ <= Upper limit
Assumptions:
  1. Normal distribution
  2. Scores are sampled randomly and are independent
Discuss the issues surrounding selection of the appropriate sample size

It's all starting to become a little confusing. Which distribution do you use?

When a large sample size (generally bigger than 30 samples) is used, a z table can always be used to construct the confidence interval. It does not matter if the population distribution is normal, or if the population variance is known or not. This is because the central limit theorem assures that when the sample is large, the distribution of the sample mean is approximately normal. However, the t-statistic is more conservative because the t-statistic tends to be greater than the z-statistic, and therefore using t-statistic will result in a wider confidence interval.

However, if there is only a small sample size, a t table has to be used to construct the confidence interval when the population distribution is normal and the population variance is not known.

If the population distribution is not normal, there is no way to construct a confidence interval from a small sample (even if the population variance is known).

Therefore, all else equal, you should try to select a sample larger than 30. The larger the sample size, the more precise the confidence interval.

In general, at least one of the following is needed:
  • A normal distribution for the population.
  • A sample size that is greater than or equal to 30.

If one or both of the above occur, then a z-table or t-table is used, dependent upon whether σ is known or unknown. If neither of the above occurs, then the question cannot be answered.

A summary of the situation is as follows:
  • If the population is normally distributed, and the population variance is known, use a z-score irrespective of sample size.
  • If the population is normally distributed, and the population variance is unknown, use a t-score irrespective of sample size.
  • If the population is not normally distributed, and the population variance is known, use a z-score only if n >= 30, otherwise it cannot be done.
  • If the population is not normally distributed, and the population variance is unknown, use a t-score only if n >= 30, otherwise it cannot be done.

User Comments

Log in to add your own comment.
  1. danlan: Good summary.
  2. akanimo: summary

    unknown variance - always use t score
    known variance - always use z score

    (note: if population is non-normal then n >= 30 for above to remain true .. else unsolvable)
  3. DAS11: Great notes.
  4. surob: Yeah, agreed. Good notes.
    Good comments too. Thanks
  5. olukayode: where do we get these t tables from please
  6. olukayode: sm is the standard deviation for the sample, so u calculate the s.d for the sample given
  7. StanleyMo: Wow, good notes. :)

    the table can be get from the appendix of CFA curriculum books.
  8. StanleyMo: Another point to note:

    However, the t-statistic is more conservative because the t-statistic tends to be greater than the z-statistic, and therefore usingt-statistic will result in a wider confidence interval.
  9. JKiro: any knows how the sample std deviation (s(m)= 60.9736)was computed?
  10. ambar: Using the mean calculated, calculate the standard deviation s = [&#8721;(mean - observation)/n]^1/2
    Using s, calculate s(m) = s/n^1/2 = 60.9736
  11. verhuizing: (1-a) % = confidence interval.
    stel (1-a) = .95 dan is |0.025|0.95|0.025|
    Za/2 = 0.05 = 1.65 = 90%
    Za/2 = 0.025 = 1.96 = 95%
    Za 0.005 = 2.58 = 99%
  12. thekid: Jkiro...

    S= { Sum {(observation - mean)^2} / n -1 }^0.5 ....B/c SAMPLE standard deviation.
  13. thekid: Continuation from previous comment...

    Jkiro...

    "n-1" b.c sample standard deviation

    Then use 'S' to solve for S(m)=s/(n)^0.5
  14. jpducros: please note that the higher the df, the peaker the curve, but the THINNER the tails. This is different from a Leptokurdic distribution, where the higher the peak, the FATTER the tails. I think a little comment in the curriculum would be useful.
  15. EminYus: t-distributions given on exam?
  16. tankdan: Can't figure out how to get s(m)= 60.97 using the formula above. I'm doing each (obs-mean)^2 and summing all the results for 111,533.34. Then dividing by 5 (N-1 which = 22306.67. Then taking the square root which equals 149.35.

    What am I missing?!
  17. jjsiow: I cannot get 60.97 either. Can someone please help explain?
  18. johntan1979: The sd from calculations is 149.354165.

    Since sample sd is not known or given, we have to estimate sample sd by dividing 149.354165 by the square root of sample size i.e. 6^1/2

    giving you 60.973583
  19. Yrazzaq88: Perfect summary notes!
  20. Yrazzaq88: I really suggest that everyone who is struggling with this chapter to go back to the basics, and take things very slowly. I know time is limited, but you cannot rush with Stats. Everything starts from the basics and develops from there.

Study Tools

Log in to print out this LOS.
Log in to mark this LOS as complete.
Once you log in, you can bookmark this LOS for later review.
Add your private note after you log in.
Log in and add your comment to the LOS.

My Account