*This is part of my coursework for Data Analysis Tools.*

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the polity scores for the countries in the data set.

## Categorizing the variables

Both response variable (polity score) and explanatory variable (suicide rate per 100000) are quantitative variables. In order to perform a Chi-square test, I had to categorize both these variables.

I created 5 roughly equal categories for suicide rate:

- Very low: 0 to 4.5 suicides per 100000 (36 countries)
- Low: 4.5 to 7 suicides per 100000 ( 38 countries)
- Medium: 7 to 10 suicides per 100000 ( 40 countries)
- High: 10 to 14 suicides per 100000 (42 countries)
- Very high: >14 suicides per 100000 (35 countries)

I created a binary categorical variable for policy, dividing countries between “democratic” and “not democratic” (based on the Polity IV project):

- 1: Democratic: 6 to 10 (88 countries)
- 0: Not democratic: -10 to 5 (71 countries)

## Contingency tables

Counts:

Column percentages:

## Chi square

The Chi-square revealed that suicide rate was not significantly associated with whether a country was a democracy. The Chi-square value was 3.17 and the p-value was 0.53, indicating that we should accept the null hypothesis in this instance.

Program read-out:

chi-square value, p value, expected counts

(3.1732413013999596, 0.52926374689586475, 4, array([[ 12.50314465, 12.50314465, 15.18238994, 16.52201258,

14.28930818],

[ 15.49685535, 15.49685535, 18.81761006, 20.47798742,

17.71069182]]))

## Python program

#import data analysis packages

import pandas

import numpy

import scipy.stats

import seaborn

import matplotlib.pyplot as plt# bug fix for display formats to avoid run time errors

pandas.set_option(‘display.float_format’, lambda x:’%f’%x)#import the entire data set to memory

data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)#ensure that variables are numeric

data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)

data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)##I only want to look at countries where stats exist for suicide rate

##get subset where suicide rates exist

sub1=data[data[‘suicideper100th’]>0]

sub2=sub1.copy()#Number of observations (rows)

print (‘Number of countries’)

print(len(sub2))# categorize suicide rates into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’

#sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])

sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[1,2,3,4,5])

sub2[‘suicide_cat’] = sub2.suicide_cat.astype(numpy.object)# categorize polity score into ‘democracy’ (1) and ‘not democracy’ (0)

#sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[‘non-democratic’,’democratic’])

sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[0,1])

sub2[‘democracy’] = sub2.democracy.astype(numpy.object)#show frequency table for democracy

print (‘Frequency table for democracy’)

c2 = sub2[‘democracy’].value_counts(sort=False, dropna=False)

print(c2)

print ()#show frequency table for suicide rate

print (‘Frequency table for suicide rate’)

c2 = sub2[‘suicide_cat’].value_counts(sort=False, dropna=False)

print(c2)

print ()# contingency table of observed counts

ct1=pandas.crosstab(sub2[‘democracy’], sub2[‘suicide_cat’])

print (ct1)# column percentages

colsum=ct1.sum(axis=0)

colpct=ct1/colsum

print(colpct)# chi-square

print (‘chi-square value, p value, expected counts’)

cs1= scipy.stats.chi2_contingency(ct1)

print (cs1)