Running an ANOVA test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the alcohol consumption for the countries in the data set.

Categorizing the explanatory variable

Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables. In order to perform an Analysis of Variance (ANOVA) test, I had to categorize the explanatory variable (alcohol consumption):

  • Very low: 0 to 2 litres
  • Low: 2 to 5 litres
  • Medium: 5 to 8.5 litres
  • High: 8.5 to 12 litres
  • Very high: >12 litres

There are 191 countries, split roughly equally into these groups:

Frequency table for alcohol consumption
NaN 6
low 40
medium 39
very low 40
very high 32
high 34
dtype: int64

ANOVA

The ANOVA revealed that suicide rate and alcohol consumption were significantly associated, with an F-statistic of 5.750 and a p-value of 0.000223.

OLS Regression Results
==============================================================================
Dep. Variable: suicideper100th R-squared: 0.113
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 5.750
Date: Tue, 10 Nov 2015 Prob (F-statistic): 0.000223
Time: 22:05:06 Log-Likelihood: -591.49
No. Observations: 185 AIC: 1193.
Df Residuals: 180 BIC: 1209.
Df Model: 4
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
———————————————————————————————–
Intercept 9.8907 1.029 9.610 0.000 7.860 11.922
C(alcohol_cat)[T.low] -1.5358 1.400 -1.097 0.274 -4.298 1.227
C(alcohol_cat)[T.medium] -0.9225 1.408 -0.655 0.513 -3.701 1.856
C(alcohol_cat)[T.very high] 4.0614 1.478 2.748 0.007 1.145 6.978
C(alcohol_cat)[T.very low] -2.1734 1.400 -1.553 0.122 -4.936 0.589
==============================================================================
Omnibus: 60.043 Durbin-Watson: 0.250
Prob(Omnibus): 0.000 Jarque-Bera (JB): 143.003
Skew: 1.436 Prob(JB): 8.86e-32
Kurtosis: 6.209 Cond. No. 6.06
==============================================================================

Post-hoc Analysis

Post-hoc comparisons (Tukey HSD) revealed that countries in the “very high” alcohol consumption group, consuming more than 12 litres of alcohol per adult, had a significantly higher suicide rate that countries in the “medium”, “low”, and “very low” groups. No other comparisons were significant.

Multiple Comparison of Means – Tukey HSD,FWER=0.05
====================================================
group1 group2 meandiff lower upper reject
—————————————————-
high low -1.5358 -5.3935 2.3219 False
high medium -0.9225 -4.8029 2.9579 False
high very high 4.0614 -0.0119 8.1346 False
high very low -2.1734 -6.0311 1.6843 False
low medium 0.6133 -3.1083 4.3349 False
low very high 5.5972 1.6748 9.5195 True
low very low -0.6376 -4.3356 3.0604 False
medium very high 4.9839 1.0392 8.9285 True
medium very low -1.2509 -4.9725 2.4707 False
very high very low -6.2348 -10.1571 -2.3124 True
—————————————————-

Python program

#import data analysis package
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (‘Number of countries’)
print(len(sub2))

# categorize alcohol consumption into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
sub2[‘alcohol_cat’] = pandas.cut(sub1.alcconsumption, [0, 2, 5, 8.5, 12, 25], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘alcohol_cat’] = sub2.alcohol_cat.astype(numpy.object)

#show frequency table for alcohol consumption categories
print (‘Frequency table for alcohol consumption’)
c2 = sub2[‘alcohol_cat’].value_counts(sort=False, dropna=False)
print(c2)
print ()

#create new subgroup for suicide and alcohol
sub3 = sub2[[‘suicideper100th’, ‘alcohol_cat’]].dropna()

# using ols function for calculating the F-statistic and associated p value
model2 = smf.ols(formula=’suicideper100th ~ C(alcohol_cat)’, data=sub3).fit()
print (model2.summary())

#print means
print (‘means for suicideper100th by alcohol consumption’)
m1= sub3.groupby(‘alcohol_cat’).mean()
print (m1)

#print standard deviations
print (‘standard deviations for suicideper100th by alcohol consumption’)
sd1 = sub3.groupby(‘alcohol_cat’).std()
print (sd1)

#perform post-hoc analysis
mc1 = multi.MultiComparison(sub3[‘suicideper100th’], sub3[‘alcohol_cat’])
res1 = mc1.tukeyhsd()
print(res1.summary())

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s