Confidence Interval for Population Proportion basic understanding in python.

A AKSHAY
5 min readFeb 7, 2020

--

Today I am going to explain about confidence intervals for population proportions. Before going though this article please go through the confidence interval theory.

The first major topic is learning about proportions.

In a sample of size 𝑁 there are 𝑀 “successes” (say, people who clicked on an advertisement) and 𝑁𝑀 “failures” (everyone else, who did not click on an advertisement). The sample proportion is then:

sample proportion

In fact, if your data 𝑥𝑖 is 1 for every “success” and 0 for every “failure”, then we can say:

sample proportion equal to sample mean

That is, the sample proportion is the sample mean of the dataset.

Let’s say we want to know what proportion of visitors (including future visitors, not yet seen) will click on our ad based on previous data. How can we go from a sample proportion to a statement about the population proportion? then confidence interval comes into play.

Confidence Interval for Population Proportion

We can construct a confidence interval, an interval we believe will contain the true population proportion of visitors who click our ad. We have an interval with a lower and upper bound and we believe that the true population proportion is within this interval with some level of confidence. For a 95% confidence interval, we are “95% confident” the true proportion is in the interval (in the sense that such intervals contain the population proportion 95% of the time).

The classical way to construct this interval is to use the interval:

Confidence interval for population propotion

where 𝑧𝑝 is the 100×𝑝100×pth percentile of the Normal distribution. And alpha(α) is significance level.

In Python, the statsmodels package can be used for statistical computations such as computing a confidence interval.

Let’s suppose that on a certain website, out of 1126 visitors on a given day, 310 clicked on an ad purchased by a sponsor. Let’s construct a confidence interval for the population proportion of visitors who click the ad.

import statsmodels.api as sm310 / 1126    # Sample proportion
# Function for computing confidence intervals
from statsmodels.stats.proportion import proportion_confint

proportion_confint(count=310, # Number of "successes"
nobs=1126, # Number of trials
alpha=(1 - 0.95))
# Alpha, which is 1 minus the confidence level
OUTPUT:
(0.24922129423231776, 0.30140037539468045)

If we wanted a 99% confidence interval, we would have a wider interval, but more confidence that the true proportion lies in this interval.

proportion_confint(310, 1126, alpha=(1 - 0.99))OUTPUT:
(0.24102336643386685, 0.30959830319313136)

Now will use hypotheses testing for a business use case with problem statement.

The website administrator claims that 30% of visitors to the website click the advertisement. Is this true? The sample proportion does not match the administrator’s claim, but this does not discredit the claim.

We will do a statistical test to test the administrator’s claim. We test the null hypothesis:

Null hypothesis

(where 𝑝p denotes the true proportion of visitors who click the ad on the site) against the alternative hypothesis:

Alternate hypothesis

How do we do this? We first compute a test statistic.

Test statistic

We then compute a 𝑝-value, which can be interpreted as the probability of observing a test statistic at least as “extreme” as the test statistic actually observed. If the 𝑝-value is small, we will reject 𝐻0 and conclude that the administrator’s claim is false; the proportion of visitors who click the ad is not 0.3. If the 𝑝-value is not small, then we do not reject 𝐻0; the evidence from our data does not contradict his claim.

What counts as a “small” 𝑝-value? Here, we will decide that if a 𝑝-value is less than 0.05, then the 𝑝-value is “small” and we reject the null hypothesis. If we see a 𝑝-value greater than 0.05, we will not reject the null hypothesis. (We could have chosen a number other than 0.05; maybe 0.01 if we wanted to enter on the side of not contradicting the administrator.)

I now conduct the test and compute the 𝑝-value.

# Performs the test just described
from statsmodels.stats.proportion import proportions_ztest
res = proportions_ztest(count=310,
nobs=1126,
value=0.3, # The hypothesized value of population proportion p
alternative='two-sided') # Tests the "not equal to" alternative hypothesis
OUTPUT:
(-1.8547614674673856, 0.063630296776840831)
# A tuple; the first entry is the value of the test statistic, and # the second is the p-value

Here, we got a test statistic of 𝑧≈−1.85 and a 𝑝-value of ≈0.0636>0.05. We conclude there is not enough statistical evidence to disagree with the website administrator.

Testing for Common Proportions.

The website decides to conduct an experiment. One day, the website shows its visitors different versions of an advertisement created by a sponsor. Users are randomly assigned to Version A and Version B. The website tracks how often Version A was clicked and how often Version B was clicked.

On this day, 516 visitors saw Version A of the ad, and 510 saw Version B. Of those who saw Version A, 108 clicked the ad, while 144 clicked Version B when shown.

Which ad generates more clicks?

Here we test the following hypotheses:

null and alternate hypothesis.

The test statistic for this test is:

test statistic

where 𝑝̂𝐴 and 𝑝̂𝐵 are the sample proportions for group A and group B and 𝑝̂ is the proportion from the pooled sample (grouping A and B together). proportions_ztest() can perform this test.

import numpy as npproportions_ztest(count=np.array([108, 144]),
nobs=np.array([516, 510]),
alternative='two-sided')
OUTUT:
(-2.7179204953199174, 0.0065693621488401655)

With a p-value of about 0.0066, which is small, we reject the null hypothesis; it appears that the two ads do not have the same proportion of clicks.

Thanks for reading. If you liked this article please follow me and share.

--

--