### Seeing is believing!

Before you order, simply sign up for a free user account and in seconds you'll be experiencing the best in CFA exam preparation.

##### Subject 5. Confidence Intervals for the Population Mean and Selection of Sample Size
Confidence intervals are typically constructed using the following structure:

Confidence Interval = Point Estimate ± Reliability Factor x Standard Error

• The point estimate is the value of a sample statistic of the population parameter.
• The reliability factor is a number based on the sampling distribution of the point estimate and the degree of confidence (1 - α).
• Standard error refers to the standard error of the sample statistic that is used to produce the point estimate.

Whatever the distribution of the population, the sample mean is always the point estimate used to construct the confidence intervals for the population mean. The reliability factor and the standard error, however, may vary depending on three factors:

• Distribution of population: normal or non-normal
• Population variance: known or unknown
• Sample size: large or small

z-Statistic: a standard normal random variable

If a population is normally distributed with a known variance, a z-statistic is used as the reliability factor to construct confidence intervals for the population mean.

In practice, the population standard deviation is rarely known. However, learning how to compute a confidence interval when the standard deviation is known is an excellent introduction to how to compute a confidence interval when the standard deviation has to be estimated.

Three values are used to construct a confidence interval for μ:

• the sample mean (m)
• the value of z (which depends on the level of confidence)
• the standard error of the mean (σ)m

The confidence interval has m for its center and extends a distance equal to the product of z and in both directions. Therefore, the formula for a confidence interval is:

m - z σm <= μ <= m + z σm

For a (1 - α)% confidence interval for the population mean, the z-statistic to be used is zα/2. zα/2 denotes the points of the standard normal distribution such that α/2 of the probability falls in the right-hand tail.

Effectively, what is happening is that the (1 - α)% of the area that makes up the confidence interval falls in the center of the graph, that is, symmetrically around the mean. This leaves α% of the area in both tails, or α/2 % of area in each tail.

Commonly used reliability factors are as follows:

• 90% confidence intervals: z0.05 = 1.645. α is 10%, with 5% in each tail
• 95% confidence intervals: z0.025 = 1.96. α is 5%, with 2.5% in each tail
• 99% confidence intervals: z0.005 = 2.575. α is 1%, with 0.5% in each tail

Example

Assume that the standard deviation of SAT verbal scores in a school system is known to be 100. A researcher wishes to estimate the mean SAT score and compute a 95% confidence interval from a random sample of 10 scores.

The 10 scores are: 320, 380, 400, 420, 500, 520, 600, 660, 720, and 780. Therefore, m = 530, N = 10, and σm= 100 / 101/2 = 31.62. The value of z for the 95% confidence interval is the number of standard deviations one must go from the mean (in both directions) to contain .95 of the scores.

It turns out that one must go 1.96 standard deviations from the mean in both directions to contain .95 of the scores. The value of 1.96 was found using a z table. Since each tail is to contain .025 of the scores, you find the value of z for which 1 - 0.025 = 0.975 of the scores are below. This value is 1.96.

All the components of the confidence interval are now known: m = 530, σm = 31.62, z = 1.96.
Lower limit = 530 - (1.96)(31.62) = 468.02
Upper limit = 530 + (1.96)(31.62) = 591.98

Therefore, 468.02 ≤ μ ≤ 591.98. This means that the experimenter can be 95% certain that the mean SAT in the school system is between 468 and 592. This also means if the experimenter repeatedly took samples from the population and calculated a number of different 95% confidence intervals using the sample information, on average 95% of those intervals would contain μ. Notice that this is a rather large range of scores. Naturally, if a larger sample size had been used, the range of scores would have been smaller.

The computation of the 99% confidence interval is exactly the same except that 2.58 rather than 1.96 is used for z. The 99% confidence interval is: 448.54 <= μ <= 611.46. As it must be, the 99% confidence interval is even wider than the 95% confidence interval.

Summary of Computations

• Compute m = ∑X/N
• Compute σm = σ/N1/2
• Find z (1.96 for 95% interval; 2.58 for 99% interval)
• Lower limit = m - z σm
• Upper limit = m + z σm
• Lower limit <= μ <= Upper limit

Assumptions:

• Normal distribution
• σ is known
• Scores are sampled randomly and are independent

There are three other points worth mentioning here:

• The point estimate will always lie exactly at the midway mark of the confidence interval. This is because it is the "best" estimate for μ, and so the confidence interval expands out from it in both directions.
• The higher the percentage of confidence, the wider the interval will be. As the percentage is increased, a wider interval is needed to give us a greater chance of capturing the unknown population value within that interval.
• The width of the confidence interval is always twice the part after the positive or negative sign, that is, twice the reliability factor x standard error. The width is simply the upper limit minus the lower limit.

It is very rare for a researcher wishing to estimate the mean of a population to already know its standard deviation. Therefore, the construction of a confidence interval almost always involves the estimation of both μ and σ.

Students' t-Distribution

When σ is known, the formula m - z σm <= μ <= m + z σm is used for a confidence interval. When σ is not known, σm = s/N1/2 (N is the sample size) is used as an estimate of σ and μ. Whenever the standard deviation is estimated, the t rather than the normal (z) distribution should be used. The values of t are larger than the values of z, so confidence intervals when σ is estimated are wider than confidence intervals when σ is known. The formula for a confidence interval for μ when σ is estimated is:

m - t sm <= μ <= m + t sm

where m is the sample mean, sm is an estimate of σm, and t depends on the degrees of freedom and the level of confidence.

The t-distribution is a symmetrical probability distribution defined by a single parameter known as degrees of freedom (df). Each value for the number of degrees of freedom defines one distribution in this family of distributions. Like a standard normal distribution (e.g., a z-distribution), the t-distribution is symmetrical around its mean. Unlike a standard normal distribution, the t-distribution has the following unique characteristics.

• It is an estimated standardized normal distribution. When n gets larger, t approximates z (s approaches σ).
• The mean is 0 and the distribution is bell-shaped.
• There is not one t-distribution, but a family of t-distributions. All t-distributions have the same mean of 0. Standard deviations of these t-distributions differ according to the sample size, n.
• The shape of the distribution depends on degrees of freedom (n - 1). The t-distribution is less peaked than a standard normal distribution and has fatter tails (i.e., more probability in the tails).
• tα/2 tends to be greater than zα/2 for a given level of significance, α.
• Its variance is v/(v-2) (for v > 2), where v = n-1. It is always larger than 1. As v increases, the variance approaches 1.

The value of t can be determined from a t-table. The degrees of freedom for t are equal to the degrees of freedom for the estimate of σm, which is equal to N-1.

A portion of a t-table is presented below:

Suppose the sample size (n) is 30 and the level of significance (α) is 5%. df = n - 1 = 29. tα/2 = t0.025 = 2.045 (Find the 29 df row, and then move to the 0.05 column.)

Example

Assume a researcher is interested in estimating the mean reading speed (number of words per minute) of high school graduates and computing the 95% confidence interval. A sample of 6 graduates was taken; reading speeds were: 200, 240, 300, 410, 450, and 600. For these data,

• m = 366.6667
• sm = 60.9736
• df = 6-1 = 5
• t = 2.571

Therefore, the lower limit is: m - (t) (sm) = 209.904 and the upper limit is: m + (t) (sm) = 523.430. Therefore, the 95% confidence interval is: 209.904 <= μ <= 523.430.

Thus, the researcher can be 95% sure that the mean reading speed of high school graduates is between 209.904 and 523.430.

Summary of Computations

• Compute m = ∑X/N
• Compute s
• Compute σm = s/N1/2
• Compute df = N-1
• Find t for these df using a t table
• Lower limit = m - t sm
• Upper limit = m + t sm
• Lower limit <= μ <= Upper limit

Assumptions:

• Normal distribution
• Scores are sampled randomly and are independent.

Discuss the issues surrounding selection of the appropriate sample size

It's all starting to become a little confusing. Which distribution do you use?

When a large sample size (generally larger than 30 samples) is used, a z-table can always be used to construct the confidence interval. It does not matter if the population distribution is normal or if the population variance is known. This is because the central limit theorem assures us that when the sample is large, the distribution of the sample mean is approximately normal. However, the t-statistic is more conservative because it tends to be greater than the z-statistic; therefore, using a t-statistic will result in a wider confidence interval.

If there is only a small sample size, a t-table has to be used to construct the confidence interval when the population distribution is normal and the population variance is not known.

If the population distribution is not normal, there is no way to construct a confidence interval from a small sample (even if the population variance is known).

Therefore, if all other factors are equal, you should try to select a sample larger than 30. The larger the sample size, the more precise the confidence interval.

In general, at least one of the following is needed:

• a normal distribution for the population
• a sample size that is greater than or equal to 30

If one or both of the above occur, a z-table or t-table is used, dependent upon whether σ is known or unknown. If neither of the above occurs, then the question cannot be answered.

A summary of the situation is as follows:

• If the population is normally distributed and the population variance is known, use a z-score (irrespective of sample size).
• If the population is normally distributed and the population variance is unknown, use a t-score (irrespective of sample size).
• If the population is not normally distributed, and the population variance is known, use a z-score only if n >= 30; otherwise, it cannot be calculated.
• If the population is not normally distributed and the population variance is unknown, use a t-score only if n >= 30; otherwise, it cannot be calculated.

User Comment
danlan Good summary.
akanimo summary

unknown variance - always use t score
known variance - always use z score

(note: if population is non-normal then n >= 30 for above to remain true .. else unsolvable)
DAS11 Great notes.
surob Yeah, agreed. Good notes.
olukayode where do we get these t tables from please
olukayode sm is the standard deviation for the sample, so u calculate the s.d for the sample given
StanleyMo Wow, good notes. :)

the table can be get from the appendix of CFA curriculum books.
StanleyMo Another point to note:

However, the t-statistic is more conservative because the t-statistic tends to be greater than the z-statistic, and therefore usingt-statistic will result in a wider confidence interval.
JKiro any knows how the sample std deviation (s(m)= 60.9736)was computed?
ambar Using the mean calculated, calculate the standard deviation s = [&#8721;(mean - observation)/n]^1/2
Using s, calculate s(m) = s/n^1/2 = 60.9736
verhuizing (1-a) % = confidence interval.
stel (1-a) = .95 dan is |0.025|0.95|0.025|
Za/2 = 0.05 = 1.65 = 90%
Za/2 = 0.025 = 1.96 = 95%
Za 0.005 = 2.58 = 99%
thekid Jkiro...

S= { Sum {(observation - mean)^2} / n -1 }^0.5 ....B/c SAMPLE standard deviation.
thekid Continuation from previous comment...

Jkiro...

"n-1" b.c sample standard deviation

Then use 'S' to solve for S(m)=s/(n)^0.5
jpducros please note that the higher the df, the peaker the curve, but the THINNER the tails. This is different from a Leptokurdic distribution, where the higher the peak, the FATTER the tails. I think a little comment in the curriculum would be useful.
EminYus t-distributions given on exam?
tankdan Can't figure out how to get s(m)= 60.97 using the formula above. I'm doing each (obs-mean)^2 and summing all the results for 111,533.34. Then dividing by 5 (N-1 which = 22306.67. Then taking the square root which equals 149.35.

What am I missing?!
johntan1979 The sd from calculations is 149.354165.

Since sample sd is not known or given, we have to estimate sample sd by dividing 149.354165 by the square root of sample size i.e. 6^1/2

giving you 60.973583
Yrazzaq88 Perfect summary notes!
Yrazzaq88 I really suggest that everyone who is struggling with this chapter to go back to the basics, and take things very slowly. I know time is limited, but you cannot rush with Stats. Everything starts from the basics and develops from there.
michaeloa3 Great Summary!
guest it makes me feel a lot better when analystNotes admits it; "it's all starting to become a little confusing.."
at least, I'm being understood :)
maryprz14 wider confidence interval=>> more conservative?
maryprz14 To answer my own question;
tValues>zValues =>> wider confidence interval using t-Table =>> more conservatiove
got it :)
akhlo When population standard deviation is known, use z-distribution. First find sample mean (average of all samples), then find sample standard deviation, which is population standard deviation divided by square root of N. Find the z-score in the z-table in accordance with the CI required. The point estimators (lower and upper limits) are sample mean +/- sample standard deviation x z-score.