Running a Chi-Square Test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the polity scores for the countries in the data set.

Categorizing the variables

Both response variable (polity score) and explanatory variable (suicide rate per 100000) are quantitative variables. In order to perform a Chi-square test, I had to categorize both these variables.

I created 5 roughly equal categories for suicide rate:

  1. Very low: 0 to 4.5 suicides per 100000 (36 countries)
  2. Low: 4.5 to 7 suicides per 100000 ( 38 countries)
  3. Medium: 7 to 10 suicides per 100000 ( 40 countries)
  4. High: 10 to 14 suicides per 100000 (42 countries)
  5. Very high: >14 suicides per 100000 (35 countries)

I created a binary categorical variable for policy, dividing countries between “democratic” and “not democratic” (based on the Polity IV project):

  • 1: Democratic: 6 to 10 (88 countries)
  • 0: Not democratic: -10 to 5 (71 countries)

Contingency tables

Counts:

 

Column_count_suicide_polity

Column percentages:

Column_percent_suicide_polity

Chi square

The Chi-square revealed that suicide rate was not significantly associated with whether a country was a democracy. The Chi-square value was 3.17 and the p-value was 0.53, indicating that we should accept the null hypothesis in this instance.

Program read-out:

chi-square value, p value, expected counts
(3.1732413013999596, 0.52926374689586475, 4, array([[ 12.50314465, 12.50314465, 15.18238994, 16.52201258,
14.28930818],
[ 15.49685535, 15.49685535, 18.81761006, 20.47798742,
17.71069182]]))

 Python program

#import data analysis packages
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

##I only want to look at countries where stats exist for suicide rate
##get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (‘Number of countries’)
print(len(sub2))

# categorize suicide rates into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
#sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[1,2,3,4,5])
sub2[‘suicide_cat’] = sub2.suicide_cat.astype(numpy.object)

# categorize polity score into ‘democracy’ (1) and ‘not democracy’ (0)
#sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[‘non-democratic’,’democratic’])
sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[0,1])
sub2[‘democracy’] = sub2.democracy.astype(numpy.object)

#show frequency table for democracy
print (‘Frequency table for democracy’)
c2 = sub2[‘democracy’].value_counts(sort=False, dropna=False)
print(c2)
print ()

#show frequency table for suicide rate
print (‘Frequency table for suicide rate’)
c2 = sub2[‘suicide_cat’].value_counts(sort=False, dropna=False)
print(c2)
print ()

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘democracy’], sub2[‘suicide_cat’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

Advertisements

Running an ANOVA test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the alcohol consumption for the countries in the data set.

Categorizing the explanatory variable

Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables. In order to perform an Analysis of Variance (ANOVA) test, I had to categorize the explanatory variable (alcohol consumption):

  • Very low: 0 to 2 litres
  • Low: 2 to 5 litres
  • Medium: 5 to 8.5 litres
  • High: 8.5 to 12 litres
  • Very high: >12 litres

There are 191 countries, split roughly equally into these groups:

Frequency table for alcohol consumption
NaN 6
low 40
medium 39
very low 40
very high 32
high 34
dtype: int64

ANOVA

The ANOVA revealed that suicide rate and alcohol consumption were significantly associated, with an F-statistic of 5.750 and a p-value of 0.000223.

OLS Regression Results
==============================================================================
Dep. Variable: suicideper100th R-squared: 0.113
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 5.750
Date: Tue, 10 Nov 2015 Prob (F-statistic): 0.000223
Time: 22:05:06 Log-Likelihood: -591.49
No. Observations: 185 AIC: 1193.
Df Residuals: 180 BIC: 1209.
Df Model: 4
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
———————————————————————————————–
Intercept 9.8907 1.029 9.610 0.000 7.860 11.922
C(alcohol_cat)[T.low] -1.5358 1.400 -1.097 0.274 -4.298 1.227
C(alcohol_cat)[T.medium] -0.9225 1.408 -0.655 0.513 -3.701 1.856
C(alcohol_cat)[T.very high] 4.0614 1.478 2.748 0.007 1.145 6.978
C(alcohol_cat)[T.very low] -2.1734 1.400 -1.553 0.122 -4.936 0.589
==============================================================================
Omnibus: 60.043 Durbin-Watson: 0.250
Prob(Omnibus): 0.000 Jarque-Bera (JB): 143.003
Skew: 1.436 Prob(JB): 8.86e-32
Kurtosis: 6.209 Cond. No. 6.06
==============================================================================

Post-hoc Analysis

Post-hoc comparisons (Tukey HSD) revealed that countries in the “very high” alcohol consumption group, consuming more than 12 litres of alcohol per adult, had a significantly higher suicide rate that countries in the “medium”, “low”, and “very low” groups. No other comparisons were significant.

Multiple Comparison of Means – Tukey HSD,FWER=0.05
====================================================
group1 group2 meandiff lower upper reject
—————————————————-
high low -1.5358 -5.3935 2.3219 False
high medium -0.9225 -4.8029 2.9579 False
high very high 4.0614 -0.0119 8.1346 False
high very low -2.1734 -6.0311 1.6843 False
low medium 0.6133 -3.1083 4.3349 False
low very high 5.5972 1.6748 9.5195 True
low very low -0.6376 -4.3356 3.0604 False
medium very high 4.9839 1.0392 8.9285 True
medium very low -1.2509 -4.9725 2.4707 False
very high very low -6.2348 -10.1571 -2.3124 True
—————————————————-

Python program

#import data analysis package
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (‘Number of countries’)
print(len(sub2))

# categorize alcohol consumption into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
sub2[‘alcohol_cat’] = pandas.cut(sub1.alcconsumption, [0, 2, 5, 8.5, 12, 25], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘alcohol_cat’] = sub2.alcohol_cat.astype(numpy.object)

#show frequency table for alcohol consumption categories
print (‘Frequency table for alcohol consumption’)
c2 = sub2[‘alcohol_cat’].value_counts(sort=False, dropna=False)
print(c2)
print ()

#create new subgroup for suicide and alcohol
sub3 = sub2[[‘suicideper100th’, ‘alcohol_cat’]].dropna()

# using ols function for calculating the F-statistic and associated p value
model2 = smf.ols(formula=’suicideper100th ~ C(alcohol_cat)’, data=sub3).fit()
print (model2.summary())

#print means
print (‘means for suicideper100th by alcohol consumption’)
m1= sub3.groupby(‘alcohol_cat’).mean()
print (m1)

#print standard deviations
print (‘standard deviations for suicideper100th by alcohol consumption’)
sd1 = sub3.groupby(‘alcohol_cat’).std()
print (sd1)

#perform post-hoc analysis
mc1 = multi.MultiComparison(sub3[‘suicideper100th’], sub3[‘alcohol_cat’])
res1 = mc1.tukeyhsd()
print(res1.summary())

Creating Graphs for Data

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to create graphs to display the variables and the relationships between them.

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc1 = sub2[‘suicideper100th’].describe()
print(desc1)
print ()

#Univariate histogram for suicide rate
#give the figure a unique number so it will display as a separate graph
plt.figure(101)
#plot suicide rate as a histogram
seaborn.distplot(sub2[‘suicideper100th’].dropna(), kde=False)
plt.xlabel(‘Suicides per 100,000 population’)
plt.ylabel(‘Number of countries’)
plt.title(‘Histogram for suicide rate’)

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc2 = sub2[‘alcconsumption’].describe()
print(desc2)
print ()

#Univariate histogram for alcohol consumption
#give the figure a unique number so it will display as a separate graph
plt.figure(102)
#plot alcohol consumption as a histogram
seaborn.distplot(sub2[‘alcconsumption’].dropna(), kde=False)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Number of countries’)
plt.title(‘Alcohol consumption’)

##next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc3 = sub2[‘polityscore’].describe()
print(desc3)
print ()
#Univariate histogram for polity score
#give the figure a unique number so it will display as a separate graph
plt.figure(103)
#plot polity score as a histogram
seaborn.distplot(sub2[‘polityscore’].dropna(), kde=False)
plt.xlabel(‘Polity score’)
plt.ylabel(‘Number of countries’)
plt.title(‘Polity score’)

#Now look at relationship between suicide and alcohol
#basic scatterplot: Q->Q
#give the figure a unique number so it will display as a seperate graph
plt.figure(104)
#plot alcohol consumption and suicide rate as a scatterplot
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=data)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Scatterplot for the Association Between Alcohol Consumption and Suicide Rate’)

# quartile split (use qcut function & ask for 4 groups – gives you quartile split)
print (‘Alcohol rate – 4 categories – quartiles’)
data[‘AlcoholGRP’]=pandas.qcut(data.alcconsumption, 4, labels=[“25th%tile”,”50%tile”,”75%tile”,”100%tile”])
c10 = data[‘AlcoholGRP’].value_counts(sort=False, dropna=True)
print(c10)
print ()

# bivariate bar graph C->Q
#give the figure a unique number so it will display as a separate graph
plt.figure(205)
#plot alcohol consumption as quartiles and display against suicide rate
seaborn.factorplot(x=’AlcoholGRP’, y=’suicideper100th’, data=data, kind=”bar”, ci=None)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Bar Chart for the Association Between Alcohol Consumption and Suicide Rate’)

Suicide rate

Description of the suicideper100th variable:

Suicide rate per 100,000 population, age adjusted
————————-
count 191.000000
mean 9.640839
std 6.300178
min 0.201449
25% 4.988449
50% 8.262893
75% 12.328551
max 35.752872
Name: suicideper100th, dtype: float64

The univariate graph for suicide rate:

UnivariateSuicide

This graph is unimodal, with its highest peak at around 7.5 to 10 suicides per 100000 of the population, which is close to the mean of 9.64. It seems to be skewed to the left as there are more countries with low suicide rates than with higher rates.

Alcohol consumption

Description of the alcconsumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
count 185.000000
mean 6.689730
std 4.908841
min 0.030000
25% 2.560000
50% 5.920000
75% 9.860000
max 23.010000
Name: alcconsumption, dtype: float64

The univariate graph for alcohol consumption:

UnivariateAlcohol

This graph is bimodal, with one peak at 0 to 2.5 litres consumed per adult, and a second peak at 7.5 to 10 litres. The mean is 6.69 litres. It seems to be skewed to the left as there are more countries with low alcohol intake than with higher rates.

Polity score

Description of the polityscore variable:

Polity (democracy) score
————————-
count 159.000000
mean 3.616352
std 6.320350
min -10.000000
25% -2.000000
50% 6.000000
75% 9.000000
max 10.000000
Name: polityscore, dtype: float64

The univariate graph for polity score:

UnivariatePolity

This graph is unimodal, with a peak at 10. It is highly skewed to the right as there are more countries with high polity scores (democracies) than with lower ones (autocracies).

Relationship between suicide rate and alcohol intake

My initial hypothesis was that a high alcohol rate would be correlated with a high suicide rate. In this case, the alcohol consumption rate is the explanatory variable that goes on the X axis, while the suicide rate is the response variable that goes on the Y axis.

Scatter plot for alcohol consumption and suicide:

Scatterplot_alcohol_suicide

This scatterplot shows a weak positive relationship between alcohol consumption and suicide rate. In other words, countries who consume high quantities of alcohol show a slight tendency to have a high suicide rate.

Bar chart for alcohol consumption (in quartiles) and suicide:

Barchart_alcohol_suicide

The bar chart shows the relationship in further detail. In the lower three quartiles, there is little relationship between alcohol consumption and suicide rate. However, there is a definite trend for countries in the highest quartile for alcohol consumption to have a high rate of suicide.

Further research is required before any definite conclusions can be drawn.

Making Data Management Decisions

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to decide how I wanted to manage the variables suicideper100thalcconsumptionpolityscore, and region.

Here is the Python program:

#import data analysis package
import pandas
import numpy

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (“Number of countries:”)
print(len(sub2))

#add spacing
print ()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#find lowest value
minSuicide = min(sub2.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(sub2.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(sub2.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate’)
sub2[‘suicideSplit’] = pandas.cut(sub2.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c1 = sub2[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c1)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate”)
p1 = sub2[‘suicideSplit’].value_counts(sort=False)*100/len(sub2)
print(p1)
print ()

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#find lowest value
minAlcohol = min(sub2.alcconsumption)
print (‘Minimum:’,minAlcohol )
#find highest value
maxAlcohol = max(sub2.alcconsumption)
print (‘Maximum:’,maxAlcohol )
#find median
medianAlcohol = numpy.median(sub2.alcconsumption)
print (‘Average (median):’,medianAlcohol)
print ()
# categorize alcohol consumption rate based on customized splits using cut function
#put the alcohol consumption rate into “bins” for values between 0 and 25
#also show blank (NaN) values
print (‘Frequency table for alcohol consumption’)
sub2[‘alcoholSplit’] = pandas.cut(sub2.alcconsumption, [0, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25])
c2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)
print(c2)
print ()
#percentage distribution for alcohol consumption rate
print (“Percentage table for alcohol consumption”)
p2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p2)
print ()

#next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#describe categories from the Polity IV Project
print(“-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy”)
print ()
#find lowest value
minPolity = min(sub2.polityscore)
print (‘Minimum:’,minPolity )
#find highest value
maxPolity = max(sub2.polityscore)
print (‘Maximum:’,maxPolity )
#find median
medianPolity = numpy.median(sub2.polityscore)
print (‘Average (median):’,medianPolity)
print ()
##display counts and percentages for polity score categories
#also show blank (NaN) values
print (‘Frequency table for polity score’)
sub2[‘politySplit’] = pandas.cut(sub2.polityscore, [-11, -6, 0, 5, 9, 10])
c3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)
print(c3)
print ()
#percentage distribution for polity score
print (“Percentage table for polity score”)
p3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p3)
print ()

#next look at regions
print (“Geographical regions”)
print (‘————————-‘)
#describe categories
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
print()
print(‘Frequency table for region’)
c4 = sub2[‘region’].value_counts(sort=False, dropna=False)
print (c4)
print ()
print (“Percentage table for region”)
p4 = sub2[‘region’].value_counts(sort=False)*100/len(sub2)
print (p4)
print ()

#Finally, look at suicide rates only in Asia
print (“Suicide rate per 100,000 population, age adjusted (Asia only)”)
print (‘————————-‘)
# get subset of values for Asia
sub3=sub2[sub2[‘region’]==1]
AsiaStats=sub3.copy()
#find lowest value
minSuicide = min(AsiaStats.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(AsiaStats.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(AsiaStats.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate (Asia only)’)
AsiaStats[‘suicideSplit’] = pandas.cut(AsiaStats.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c5 = AsiaStats[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c5)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate (Asia only)”)
p5 = AsiaStats[‘suicideSplit’].value_counts(sort=False)*100/len(AsiaStats)
print(p5)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

First, I got a subset of the Gapminder data set, including only those countries that had a value for the suicide rate:

Number of countries:
191

The first variable I looked at was the suicide rate:

Suicide rate per 100,000 population, age adjusted
————————-
Minimum: 0.2014487237
Maximum: 35.752872467
Average (median): 8.2628927231

Frequency table for suicide rate
(0, 5] 49
(5, 10] 65
(10, 15] 51
(15, 20] 12
(20, 25] 6
(25, 30] 6
(30, 35] 1
(35, 40] 1
dtype: int64

Percentage table for suicide rate
(0, 5] 25.654450
(5, 10] 34.031414
(10, 15] 26.701571
(15, 20] 6.282723
(20, 25] 3.141361
(25, 30] 3.141361
(30, 35] 0.523560
(35, 40] 0.523560
dtype: float64

The lowest reported suicide rate is 0.2 (to 2 decimal places) and the highest is 35.75. The median value is 8.26.

Because the suicide rate is a continuous variable (a typical value is something like “4.2170763016”), it is useful to put these values into bins with a range of 5 before running a frequency distribution. As we can see, the most common range is between 5 and 10 reported suicides per 100000 of the population. 65 countries, or 34.03% of the total, have between 5 and 10 suicides per 100000 of the population. Only two countries have more than 30 reported suicides per 100000 of the population. There are no “not a number”(NaN) values, because I have already stripped out any countries without a recorded suicide rate.

I next looked at the alcohol consumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
Minimum: 0.03
Maximum: 23.01
Average (median): 6.16

Frequency table for alcohol consumption
(0, 2.5] 45
(2.5, 5] 35
(5, 7.5] 31
(7.5, 10] 30
(10, 12.5] 20
(12.5, 15] 13
(15, 17.5] 8
(17.5, 20] 2
(20, 22.5] 0
(22.5, 25] 1
NaN 6
dtype: int64

Percentage table for alcohol consumption
(0, 2.5] 23.560209
(2.5, 5] 18.324607
(5, 7.5] 16.230366
(7.5, 10] 15.706806
(10, 12.5] 10.471204
(12.5, 15] 6.806283
(15, 17.5] 4.188482
(17.5, 20] 1.047120
(20, 22.5] 0.000000
(22.5, 25] 0.523560
NaN 3.141361
dtype: float64

The lowest alcohol consumption rate is 0.03 litres per adult, and the highest is 23.01. The median value is 6.16 litres.

Again, the alcohol consumption figure is a continuous variable, so I put these values into bins with a range of 2.5 litres. As we can see, the most common range is between 0 and 2.5 litres. In 45 countries, or 23.56% of the total, each person over the age of 15 drank an average of between 0 and 2.5 litres of pure alcohol. Only one country has an alcohol consumption rate of over 20 litres. No data is available for 6 countries (2.81%).

I next created a frequency table for the polity (democracy) score:

Polity (democracy) score
————————-
-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy

Minimum: -10.0
Maximum: 10.0
Average (median): 8.0

Frequency table for polity score
(-11, -6] 23
(-6, 0] 29
(0, 5] 19
(5, 9] 56
(9, 10] 32
NaN 32
dtype: int64

Percentage table for polity score
(-11, -6] 12.041885
(-6, 0] 15.183246
(0, 5] 9.947644
(5, 9] 29.319372
(9, 10] 16.753927
NaN 16.753927
dtype: float64

The polity score is based on the 2009 Polity IV Project, which is sponsored by the Political Instability Task Force (PITF). PITF defines countries as democracies, anocracies, or autocracies depending on this score (see http://www.systemicpeace.org/polity/polity4x.htm). I thought it useful to bin the polity scores to correspond with the PITF categories.

We can see that 32 countries, or 16.75% of the total, score as Full Democracies. The most common category is Democracy, with 56 countries or 29.32% of the total. 19 countries (9.95%) scored as Open Anocracies, 29 (15.18%) as Closed Autocracies, and 23 (12.04%) as Autocracies. No data was available for 32 countries (16.75%).

I next created a frequency table for region:

Geographical regions
————————-
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7

Frequency table for region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64

Percentage table for region
1 16.230366
2 24.083770
3 25.654450
4 8.376963
5 11.518325
6 6.282723
7 7.853403
dtype: float64

Because these are “dummy” values, there’s no need to calculate medians or to put values into bins.

The “region” variable did not exist in the original Gapminder data set, but I created it as a potentially useful category for grouping countries. As an example of this, here is a frequency distribution of suicide rates for only Asian countries:

Suicide rate per 100,000 population, age adjusted (Asia only)
————————-
Minimum: 1.3809646368
Maximum: 28.1040458679
Average (median): 11.3961114883

Frequency table for suicide rate (Asia only)
(0, 5] 6
(5, 10] 8
(10, 15] 8
(15, 20] 4
(20, 25] 2
(25, 30] 3
(30, 35] 0
(35, 40] 0
dtype: int64

Percentage table for suicide rate (Asia only)
(0, 5] 19.354839
(5, 10] 25.806452
(10, 15] 25.806452
(15, 20] 12.903226
(20, 25] 6.451613
(25, 30] 9.677419
(30, 35] 0.000000
(35, 40] 0.000000
dtype: float64

We can see some differences between the Asian and the worldwide data. Asian countries have a median of 11.4  suicides per 100000 of the population, much higher than the worldwide median of 8.26. The frequency distribution is also different; a lower percentage of Asian countries have a suicide rate under 10 per 100000, while a much higher percentage of Asian countries are in the 15 to 30 categories. However, no Asian country has a suicide rate of over 30 per 100000.

A similar breakdown could be done for other regions or other variables.

Running my first Python program

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder.

Here is my Python program:

#import data analysis package
import pandas

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

print (“Number of countries:”)
print(len(data)) # Number of observations (rows)
print (“Number of variables:”)
print(len(data.columns)) # Number of variables (columns)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#display counts and percentages for suicide rates
print (“counts for suicide rate per 100,000 population, age adjusted”)
ct1 = data.groupby(‘suicideper100th’).size()
print (ct1)
print (“percentages for suicide rate per 100,000 population, age adjusted”)
pt1 = data.groupby(‘suicideper100th’).size()*100/len(data)
print (pt1)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Total number of countries where suicide stats exist
print (“Total number of countries where suicide stats exist”)
print(len(sub2))

#display counts and percentages for polity score
print (“counts for polity score”)
ct2 = sub2.groupby(‘polityscore’).size()
print (ct2)
print (“percentages for polity score”)
pt2 = sub2.groupby(‘polityscore’).size()*100/len(data)
print (pt2)

#display counts and percentages for regions
print(‘counts for region’)
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
ct3 = sub2.groupby(‘region’).size()
print (ct3)
print (“percentages for region”)
pt3 = sub2.groupby(‘region’).size()*100/len(data)
print (pt3)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

I first looked at the number of countries and variables:

Number of countries:
213
Number of variables:
17

The columns include all the countries from the Gapminder data set. The rows include the 16 variables from the Gapminder data set, plus an extra variable called “region”.

I then produced frequency rate for three variables. Because my research project is looking at factors that influence suicide, the first variable is suicideper100th:

counts for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 1
0.523528 1
1.370002 1
1.380965 1
1.392951 1
1.498057 1
1.519248 1
1.574350 1
1.658908 1
1.799904 1
1.922485 1
2.034178 1
2.109414 1
2.161843 1
2.206169 1
2.234896 1
2.515721 1
2.648981 1
2.816705 1
3.108603 1
3.146814 1
3.374416 1
3.563325 1
3.576478 1
3.716739 1
3.741588 1
3.940259 1
4.079525 1
4.119620 1
4.217076 1
..
14.554677 1
14.680936 1
14.713020 1
14.776250 1
15.538490 1
15.542603 1
15.714571 1
15.953850 1
16.234370 1
16.913248 1
16.959240 1
17.032646 1
18.583826 1
18.946930 1
18.954570 1
19.422610 1
20.162010 1
20.317930 1
20.369590 1
20.747431 1
22.353479 1
22.404560 1
25.404600 1
26.219198 1
26.874690 1
27.874160 1
28.104046 1
29.864164 1
33.341860 1
35.752872 1
dtype: int64
percentages for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 0.469484
0.523528 0.469484
1.370002 0.469484
1.380965 0.469484
1.392951 0.469484
1.498057 0.469484
1.519248 0.469484
1.574350 0.469484
1.658908 0.469484
1.799904 0.469484
1.922485 0.469484
2.034178 0.469484
2.109414 0.469484
2.161843 0.469484
2.206169 0.469484
2.234896 0.469484
2.515721 0.469484
2.648981 0.469484
2.816705 0.469484
3.108603 0.469484
3.146814 0.469484
3.374416 0.469484
3.563325 0.469484
3.576478 0.469484
3.716739 0.469484
3.741588 0.469484
3.940259 0.469484
4.079525 0.469484
4.119620 0.469484
4.217076 0.469484

14.554677 0.469484
14.680936 0.469484
14.713020 0.469484
14.776250 0.469484
15.538490 0.469484
15.542603 0.469484
15.714571 0.469484
15.953850 0.469484
16.234370 0.469484
16.913248 0.469484
16.959240 0.469484
17.032646 0.469484
18.583826 0.469484
18.946930 0.469484
18.954570 0.469484
19.422610 0.469484
20.162010 0.469484
20.317930 0.469484
20.369590 0.469484
20.747431 0.469484
22.353479 0.469484
22.404560 0.469484
25.404600 0.469484
26.219198 0.469484
26.874690 0.469484
27.874160 0.469484
28.104046 0.469484
29.864164 0.469484
33.341860 0.469484
35.752872 0.469484
dtype: float64

The suicide rate is the mortality due to self-inflicted injury, per 100 000 standard population, adjusted for age. It is taken from a combination of time series from WHO Violence and Injury Prevention (VIP) and data from WHO Global Burden of Disease 2002 and 2004. [Further information about worldwide suicide rates can be found at the WHO web site.]

As you can see, the suicide rate is a continuous variable, so it is hard to see patterns at this stage. I may get a better picture at a later stage, when I learn to put continuous variables into ranges.

Suicide rates are not available for all countries in the Gapminder data set, so I decided to look only at those countries where the suicide rate was available:

Total number of countries where suicide stats exist
191

I next created a frequency table for polityscore, using the subset of countries:

counts for polity score
polityscore
-10 2
-9 4
-8 2
-7 12
-6 3
-5 2
-4 6
-3 6
-2 5
-1 4
0 6
1 3
2 3
3 2
4 4
5 7
6 10
7 13
8 19
9 14
10 32
dtype: int64
percentages for polity score
polityscore
-10 0.938967
-9 1.877934
-8 0.938967
-7 5.633803
-6 1.408451
-5 0.938967
-4 2.816901
-3 2.816901
-2 2.347418
-1 1.877934
0 2.816901
1 1.408451
2 1.408451
3 0.938967
4 1.877934
5 3.286385
6 4.694836
7 6.103286
8 8.920188
9 6.572770
10 15.023474
dtype: float64

The polity score, or democracy score, is based on the 2009 Polity IV Project. It is calculated by subtracting an autocracy score from a democracy score. The summary measure of a country’s democratic and free nature. -10 is the lowest value, and 10 the highest. [Further information about the Polity Project can be found at the Center for Systemic Peace web site.]

We can see that most countries scored highly on this measure. 32 countries, or 15% of the total, scored the highest possible score of 10, which means that they have open elections, checks on executive authority, and other measures of democracy. 23 countries (12% of the total) have a score below -5, which would make them autocracies.

I next created a frequency table for region, again using the subset of countries:

counts for region
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7
region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64
percentages for region
region
1 14.553991
2 21.596244
3 23.004695
4 7.511737
5 10.328638
6 5.633803
7 7.042254
dtype: float64

The region variable did not exist in the original Gapminder data set, but I thought it would be useful to see patterns across the geographical regions of the world. I assigned a number to each region:

  1. Asia
  2. Europe
  3. Africa
  4. Middle East
  5. North and Central America
  6. South America
  7. Oceania

As we can see, Africa has the largest number of countries in this data set (49 countries, or 23% of the total), followed closely by Europe (46 countries, or 21.59% of the total). The smallest number of countries are found in South America (12, or 5.63% of the total).

Selecting a data set

This is part of my coursework for Data Management and Visualization.

I decided to use the codebook for Gapminder, I’ve decided to look at factors that affect the suicide rates of different countries.

I am particularly interested in the relationship between alcohol consumption and the suicide rate. My hypothesis is that alcohol consumption will be positively correlated to suicide rates. Alcohol can lower inhibitions and therefore increase the likelihood that someone will follow through on suicidal thoughts. Also, factors that increase the likelihood of alcoholism (for example, depression) might also be associated with suicide rates.

Some previous studies have found that alcohol dependence is an important risk factor for suicidal behaviour, for example, Sher 2005. However, the link has not been shown conclusively, for example, Isacsson 2000 found no correlation between alcohol consumption and suicide. Lester 1995, in a study of 13 nations, found that “suicide and homicide rates were usually, but not always, positively associated with the per capita consumption of alcohol”. However, Lester’s study uses old data, and I’m curious to know if this correlation holds up with 21st century data. 

It’s likely that other factors moderate the influence of alcoholism and suicide. These factors include income per person, employment rate, democracy score, and urban rate.

There may be other confounding factors, such as under-reporting of suicide or cultural factors around alcohol consumption. However, these are beyond the scope of this study.

I have created a personal codebook from the Gapminder data. My personal codebook includes variables for suicideper100TH (mortality due to self-inflicted injury, per 100 000 standard population, age adjusted) and alcconsumption (recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol). It also includes variables for income per person, polity (democracy) score, employment rate, and urbanization rate.

 

Eight deadly words

 I don’t care what happens to these people.

The “eight deadly words” were probably not first uttered by writer Dorothy J. Heydt on Usenet in 1991, but she was the first to name them as such. They are the words that no writer or producer or actor wants to hear about their work. They are the words that readers utter when they put down a book, or viewers proclaim when they switch off a film or stop watching a television series.

Deadly like a toadstool

Deadly like a toadstool

Those words are “I don’t care what happens to these people”. If you don’t care, why should you read (or watch) on?

Now, “caring” about the characters is not necessarily the same as liking them. Humbert Humbert from Lolita is a vile excuse for a human being. If you find yourself liking him, please get help now. Eva Khatchadourian, J.R. Ewing, Joffrey Baratheon: these are not nice people. However, you care about what happens to them. You might hope that really bad things happen to them, but you want to know what those bad things are.

“These people” might not even be people. In Watership Down, you come to care about rabbits; in Wall-E, you are emotionally invested in the fate of a robot.

Think of the fiction that has remained with you. The characters are real, and you want to know what happens to them. That’s the reason why readers clamored for more stories about Sherlock Holmes and Anne Shirley; that’s why viewers keep coming back for Tyrion Lannister and Walter White.

I get frustrated by clunky writing or unrealistic plotting or stunted dialogue. But the one thing that will make me stick with a story is that I want to find out what happens. The one thing that will make me put it away is when I just don’t care.