*This is part of my coursework for Data Analysis Tools.*

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the alcohol consumption for the countries in the data set.

## Categorizing the explanatory variable

Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables. In order to perform an Analysis of Variance (ANOVA) test, I had to categorize the explanatory variable (alcohol consumption):

- Very low: 0 to 2 litres
- Low: 2 to 5 litres
- Medium: 5 to 8.5 litres
- High: 8.5 to 12 litres
- Very high: >12 litres

There are 191 countries, split roughly equally into these groups:

Frequency table for alcohol consumption

NaN 6

low 40

medium 39

very low 40

very high 32

high 34

dtype: int64

## ANOVA

The ANOVA revealed that suicide rate and alcohol consumption were significantly associated, with an F-statistic of 5.750 and a p-value of 0.000223.

OLS Regression Results

==============================================================================

Dep. Variable: suicideper100th R-squared: 0.113

Model: OLS Adj. R-squared: 0.094

Method: Least Squares F-statistic: 5.750

Date: Tue, 10 Nov 2015 Prob (F-statistic): 0.000223

Time: 22:05:06 Log-Likelihood: -591.49

No. Observations: 185 AIC: 1193.

Df Residuals: 180 BIC: 1209.

Df Model: 4

Covariance Type: nonrobust

===============================================================================================

coef std err t P>|t| [95.0% Conf. Int.]

———————————————————————————————–

Intercept 9.8907 1.029 9.610 0.000 7.860 11.922

C(alcohol_cat)[T.low] -1.5358 1.400 -1.097 0.274 -4.298 1.227

C(alcohol_cat)[T.medium] -0.9225 1.408 -0.655 0.513 -3.701 1.856

C(alcohol_cat)[T.very high] 4.0614 1.478 2.748 0.007 1.145 6.978

C(alcohol_cat)[T.very low] -2.1734 1.400 -1.553 0.122 -4.936 0.589

==============================================================================

Omnibus: 60.043 Durbin-Watson: 0.250

Prob(Omnibus): 0.000 Jarque-Bera (JB): 143.003

Skew: 1.436 Prob(JB): 8.86e-32

Kurtosis: 6.209 Cond. No. 6.06

==============================================================================

## Post-hoc Analysis

Post-hoc comparisons (Tukey HSD) revealed that countries in the “very high” alcohol consumption group, consuming more than 12 litres of alcohol per adult, had a significantly higher suicide rate that countries in the “medium”, “low”, and “very low” groups. No other comparisons were significant.

Multiple Comparison of Means – Tukey HSD,FWER=0.05

====================================================

group1 group2 meandiff lower upper reject

—————————————————-

high low -1.5358 -5.3935 2.3219 False

high medium -0.9225 -4.8029 2.9579 False

high very high 4.0614 -0.0119 8.1346 False

high very low -2.1734 -6.0311 1.6843 False

low medium 0.6133 -3.1083 4.3349 False

low very high 5.5972 1.6748 9.5195 True

low very low -0.6376 -4.3356 3.0604 False

medium very high 4.9839 1.0392 8.9285 True

medium very low -1.2509 -4.9725 2.4707 False

very high very low -6.2348 -10.1571 -2.3124 True

—————————————————-

## Python program

#import data analysis package

import pandas

import numpy

import statsmodels.formula.api as smf

import statsmodels.stats.multicomp as multi# bug fix for display formats to avoid run time errors

pandas.set_option(‘display.float_format’, lambda x:’%f’%x)#import the entire data set to memory

data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)#ensure that variables are numeric

data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)

data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)#I only want to look at countries where stats exist for suicide rate

#get subset where suicide rates exist

sub1=data[data[‘suicideper100th’]>0]

sub2=sub1.copy()#Number of observations (rows)

print (‘Number of countries’)

print(len(sub2))# categorize alcohol consumption into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’

sub2[‘alcohol_cat’] = pandas.cut(sub1.alcconsumption, [0, 2, 5, 8.5, 12, 25], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])

sub2[‘alcohol_cat’] = sub2.alcohol_cat.astype(numpy.object)#show frequency table for alcohol consumption categories

print (‘Frequency table for alcohol consumption’)

c2 = sub2[‘alcohol_cat’].value_counts(sort=False, dropna=False)

print(c2)

print ()#create new subgroup for suicide and alcohol

sub3 = sub2[[‘suicideper100th’, ‘alcohol_cat’]].dropna()# using ols function for calculating the F-statistic and associated p value

model2 = smf.ols(formula=’suicideper100th ~ C(alcohol_cat)’, data=sub3).fit()

print (model2.summary())#print means

print (‘means for suicideper100th by alcohol consumption’)

m1= sub3.groupby(‘alcohol_cat’).mean()

print (m1)#print standard deviations

print (‘standard deviations for suicideper100th by alcohol consumption’)

sd1 = sub3.groupby(‘alcohol_cat’).std()

print (sd1)#perform post-hoc analysis

mc1 = multi.MultiComparison(sub3[‘suicideper100th’], sub3[‘alcohol_cat’])

res1 = mc1.tukeyhsd()

print(res1.summary())