Running a Chi-Square Test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the polity scores for the countries in the data set.

Categorizing the variables

Both response variable (polity score) and explanatory variable (suicide rate per 100000) are quantitative variables. In order to perform a Chi-square test, I had to categorize both these variables.

I created 5 roughly equal categories for suicide rate:

  1. Very low: 0 to 4.5 suicides per 100000 (36 countries)
  2. Low: 4.5 to 7 suicides per 100000 ( 38 countries)
  3. Medium: 7 to 10 suicides per 100000 ( 40 countries)
  4. High: 10 to 14 suicides per 100000 (42 countries)
  5. Very high: >14 suicides per 100000 (35 countries)

I created a binary categorical variable for policy, dividing countries between “democratic” and “not democratic” (based on the Polity IV project):

  • 1: Democratic: 6 to 10 (88 countries)
  • 0: Not democratic: -10 to 5 (71 countries)

Contingency tables




Column percentages:


Chi square

The Chi-square revealed that suicide rate was not significantly associated with whether a country was a democracy. The Chi-square value was 3.17 and the p-value was 0.53, indicating that we should accept the null hypothesis in this instance.

Program read-out:

chi-square value, p value, expected counts
(3.1732413013999596, 0.52926374689586475, 4, array([[ 12.50314465, 12.50314465, 15.18238994, 16.52201258,
[ 15.49685535, 15.49685535, 18.81761006, 20.47798742,

 Python program

#import data analysis packages
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

##I only want to look at countries where stats exist for suicide rate
##get subset where suicide rates exist

#Number of observations (rows)
print (‘Number of countries’)

# categorize suicide rates into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
#sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[1,2,3,4,5])
sub2[‘suicide_cat’] = sub2.suicide_cat.astype(numpy.object)

# categorize polity score into ‘democracy’ (1) and ‘not democracy’ (0)
#sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[‘non-democratic’,’democratic’])
sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[0,1])
sub2[‘democracy’] = sub2.democracy.astype(numpy.object)

#show frequency table for democracy
print (‘Frequency table for democracy’)
c2 = sub2[‘democracy’].value_counts(sort=False, dropna=False)
print ()

#show frequency table for suicide rate
print (‘Frequency table for suicide rate’)
c2 = sub2[‘suicide_cat’].value_counts(sort=False, dropna=False)
print ()

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘democracy’], sub2[‘suicide_cat’])
print (ct1)

# column percentages

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)


Leave a Reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s