Research Methodology Chapter 11.2

yin yang, harmony, balance-4401011.jpg

Chi-Square Test

chi-square (χ2statistic is a test that measures how expectations compare to actual observed data (or model results). The data used in calculating a
chi-square statistic must be randomrawmutually exclusivedrawn from independent variables, and drawn from a large enough sample.

That is, the chi-square (χ2tests are certain types of statistical hypothesis tests that are valid to perform when the test statistic is chi-squared distributed under the null hypothesis.

In the standard applications of this test, the observations are classified into mutually exclusive classes. If the so-called null hypothesis is true, the test statistic computed from the observations follows a χ2 distribution.
The purpose of the test is to evaluate how likely the observed frequencies
would be assuming the null hypothesis is true. Test statistics that follow a χ2
distribution
occurs when the observations are independent and normally
distributed, which assumptions are often justified under the central limit theorem. There are also χ2 tests for testing the null hypothesis of independence of a pair of random variables based on observations of the pairs.

It determines whether or not the sampling distribution (if the null hypothesis is true) of the test statistic approximates a chi-squared distribution more and more closely as sample sizes increase.

Types of chi-square (χ2tests

There are two types of chi-square tests. Both use the chi-square statistic and distribution for different purposes:

1. A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

2. A chi-square goodness-of-fit test determines if a sample data matches a population. This test is also referred to as Goodness-of-Fit Test.

 

very small chi square test statistic means that your observed data fits your expected data extremely well. In other words, there is a relationship.

very large chi-square test statistic means that the data does not fit very well. In other words, there isn’t a relationship.

 

Assumptions

Like so many of our inference procedures, chi-square tests too have some underlying assumptions which should be in place to make the results of calculations completely trust worthy. They include: 

1 The data in the cells should be frequencies,  or counts of cases rather than percentages or some other transformation of the data.

2. The levels  (or categories) of the variables are mutually exclusive
That is,  a  particular subject fits into one and only one level of each of the variables.

3.  Each subject may contribute data to one and only one cell in the χ2. If, for example, the same subjects are tested over time such that the comparisons are of the same subjects at Time 1, Time 2, Time 3, etc., then χ2 may not be used.

4. The study groups must be independent. This means that a different test must be used if the two groups are related.  For  example,  a 
different test must be used if the researcher’s data consists of paired samples, such as in studies in which a parent is paired with his or her child.

5. There are 2 variables, and both are measured as categories, usually at the nominal level. However, data may be ordinal data. Interval or ratio data that have been collapsed into ordinal categories may also be used. While Chi-square has no rule about limiting the number of cells (by limiting the number of categories for each variable), a very large number of cells (over 20) can make it difficult to meet assumption #6 below, and to interpret the meaning of the results. 

6. The value of the cell expected should be  5 or more in at least  80%  of the cells,  and no cell should have an expected of less than one (3). This assumption is most likely to be met if the sample size equals at least the number of cells multiplied by 5. Essentially, this assumption specifies the number of cases (sample size) needed to use the χ2 for any number of cells in that χ2

 

Hypotheses:

Null Hypothesis (H0): There is “no change” or “no difference” in situation.

Alternative Hypothesis (H1): There is a “change” or “difference” in situation.

divider, separator, line art-5392042.jpg

I. Test of Independence

When considering student sex and course choice, a χ2 test for independence could be used. To do this test, the researcher would collect data on the two chosen variables (sex and courses picked) and then compare the frequencies at which male and female students select among the offered classes using the formula given above and a χ2 statistical table.

If there is no relationship between sex and course selection (that is, if they are independent), then the actual frequencies at which male and female students select each offered course should be expected to be approximately equal, or conversely, the proportion of male and female students in any selected course should be approximately equal to the proportion of male and female students in the sample.

A χ2 test for independence can tell us how likely it is that random chance can explain any observed difference between the actual frequencies in the data and these theoretical expectations.

Problem

Imagine you have surveyed 200 individuals to determine if there is a significant association between gender and preference for a particular smartphone brand.

Solution

Step 1: Formulate Hypotheses

Null Hypothesis (H0): There is no association between gender and smartphone brand preference.

Alternative Hypothesis (Ha): There is a significant association between gender and smartphone brand preference.

Step 2: Collect Data

 

iPhone

Samsung

Other

Male

30

40

10

Female

20

50

50

Step 3: Set Significance Level

Choose a significance level (commonly 0.05) to determine if the observed association is statistically significant.

Step 4: Create a Contingency Table

Sum the rows and columns and create a contingency table:

 

iPhone

Samsung

Other

Row Total

Male

30

40

10

80

Female

20

50

50

120

Column Total

50

90

60

200

Step 5: Calculate Expected Frequencies

Calculate the expected frequency for each cell using the formula:

Expected Frequency = 

Calculation for the cell in the first row, first column (iPhone, Male)

Expected Frequency = 

Expected Frequency = 20

Repeat this calculation for each cell in the table.

In the following Table the Expected Frequency calculated are given in parenthesis in each cell.

 

iPhone

Samsung

Other

Row Total

Male

30 (20)

40 (36)

10 (24)

80

Female

20 (30)

50 (54)

50 (36)

120

Column Total

50

90

60

200

Step 6: Calculate Chi-Square (χ2) Statistic

Where, O is the observed frequency, and E is the expected frequency.

For the given example, you would calculate contributions for each cell and sum them to get the chi-square value. In the Table below, value in parenthesis is the contributions for each cell.

 

iPhone

Samsung

Other

Row Total

Male

(30 – 20)2/20 (5)

(40 – 36)2/36 (0.44)

(10 – 24)2/24 (8.16)

80

Female

(20 – 30)2/30 (3.33)

(50 – 54)2/54 (0.29)

(50 – 36)2/36 (5.4)

120

Column Total

50

90

60

200

χ2 = 5 + 0.44 + 8.16 + 3.33 + 0.29 + 5.4

χ2 = 22.62

Step 7: Determine Degrees of Freedom

Degrees of freedom (df) is calculated as df = (r−1) × (c−1),

Where, r is the number of rows and c is the number of columns.

For our example,

df = (2−1) × (3−1) = 2.

Step 8: Find Critical Value or P-value

Using the chi-square distribution table or a statistical software, find the critical value or p-value corresponding to the degrees of freedom and chosen significance level.

Using a chi-square distribution table or statistical software, the critical chi-square value for df=2 at a significance level of 0.05 is approximately 5.99.

Step 9: Make a Decision

Compare the calculated chi-square value with the critical value or use the p-value to determine whether to reject the null hypothesis.

If the calculated chi-square value is greater than the critical value or the p-value is less than the significance level, reject the null hypothesis.

In our example, the calculated chi-square value (22.62) is greater than the critical value (5.99), hence we reject the null hypothesis, and confirm that there is a significant association between gender and smartphone brand preference.

divider, separator, line art-5392042.jpg

II. Goodness-of-Fit Test

χ2 provides a way to test how well a sample of data matches the (known or assumed) characteristics of the larger population that the sample is intended to represent. This is known as goodness of fit. If the sample data do not fit the expected properties of the population that we are interested in, then we would not want to use this sample to draw conclusions about the larger population.

Problem

Imagine you are studying the wing colour variation in a population of 120 Monarch butterflies in a local meadow to determine if the wing colour follows the ratio 3:1:2 for orange, black, and white variations, respectively.

Solution

Step 1: Formulate Hypotheses

Null Hypothesis (H0): The proportion of Monarch butterflies with each color variant follows the 3:1:2 ratio (60%, 20%, and 40%).

Alternative Hypothesis (Ha): The proportion of Monarch butterflies with each colour variant deviates significantly from the suspected ratio.

Step 2: Collect Data

Color

Observed Frequency

Orange

60

Black

25

White

35

Step 3: Set Significance Level

Choose a significance level (commonly 0.05) to determine if the observed association is statistically significant.

Step 4: Calculate Expected Frequencies

Based on the 3:1:2 ratio, the expected frequencies are calculated using the expected percentage for each category using the following formula. 

Expected Frequency = expected percentage * 120

Thus, for orange, the expected frequency will be (3/6) * 120 = 60, for black the expected frequency will be (1/6) * 120 = 20, and for black the expected frequency will be (2/6) * 120 = 40.  

In the following Table the Expected Frequency calculated are given.

Color

Observed Frequency

Expected Frequency

Orange

60

60

Black

25

20

White

35

40

Step 6: Calculate Chi-Square (χ2) Statistic

Where, O is the observed frequency, and E is the expected frequency.

For the given example, you would calculate contributions for each cell and sum them to get the chi-square value. In the Table below, value in parenthesis is the contributions for each cell.

Color

Chi-Square Value

Orange

(60 – 60)2/60

(0)

Black

(25 – 20)2/20

(1.25)

White

(35 – 40)2/40

(0.625)

χ2 = 0 + 1.25 + 0.625

χ2 = 1.875

Step 7: Determine Degrees of Freedom

Degrees of freedom (df) is calculated as df = n − 1,

Where, n is the number of categories.

For our example,

df = 3 – 1 = 2.

Step 8: Find the Critical Value or P-value

Using the chi-square distribution table or a statistical software, find the critical value or p-value corresponding to the degrees of freedom and chosen significance level.

Using a chi-square distribution table or statistical software, the critical chi-square value for df=2 at a significance level of 0.05 is approximately 5.99 and the p value is approximately 0.391.

Step 9: Make a Decision

Compare the calculated chi-square value with the critical value or use the p-value to determine whether to reject the null hypothesis.

If the calculated chi-square value is greater than the critical value or the p-value is less than the significance level, reject the null hypothesis.

In our example, the calculated chi-square value (1.875) is lesser than the critical value (5.99) and the p value (0.391) is greater than the significance level, hence we fail to reject the null hypothesis and confirm that there is not enough evidence to conclude that the proportion of Monarch butterflies with each colour variant deviates significantly from the suspected 3:1:2 ratio at the 5% significance level.

Leave a Comment

Your email address will not be published. Required fields are marked *