Testing a Potential Moderator

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. Using Pearson’s correlation coefficient (r), I found that alcohol consumption has a significant positive correlation with suicide rate for the whole data set (r=0.35, p=7.31e-07). However, I want to see if this correlation is dependent on the geographical region of the world.

I created a new variable called “region” and assigned a number for each geographical area:

  1. Asia (34 countries)
  2. Europe (50 countries)
  3. Africa (50 countries)
  4. Middle East (17 countries)
  5. Americas (43 countries)
  6. Oceania (19 countries)

I then tested the correlation coefficient for each region:

Association between alcohol consumption and suicide rate for ASIA
(0.10174016798317355, 0.59947647029652795)

Association between alcohol consumption and suicide rate for EUROPE
(0.60505083925984826, 1.7244830114896035e-05)

Association between alcohol consumption and suicide rate for AFRICA
(0.22021753857669515, 0.12839491207675466)

Association between alcohol consumption and suicide rate for MIDDLE EAST
(-0.056124327060797188, 0.83644085383899258)

Association between alcohol consumption and suicide rate for AMERICAS
(0.070065617449753828, 0.69376393774656941)

Association between alcohol consumption and suicide rate for OCEANIA
(0.33344561639190951, 0.24400145879606708)

In Europe, the correlation between alcohol consumption and suicide rate is significant, with a p-value of 1.72e-05, or 0.0000172.  The correlation coefficient is 0.6, indicating a strong positive linear correlation. So within Europe, a high rate of alcohol consumption is associated with a high rate of suicide.

For all other regions, the correlation between alcohol consumption and suicide rate is not significant, with a p-value greater than 0.05. For these regions, we must accept the null hypothesis that there is no relationship between alcohol consumption and suicide rate.

This is further illustrated by looking at the scatterplots for the different regions:

Scatterplots

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].replace(”,numpy.nan)

#clean NAs
data_clean=data.dropna()

#get the correlation coefficient for all regions
print (‘Association between alcohol consumption and suicide rate for ALL REGIONS’)
print (scipy.stats.pearsonr(data_clean[‘alcconsumption’], data_clean
[‘suicideper100th’]))
print(”)

#display counts for regions
print(‘counts for region’)
print (“Asia=1, Europe=2, Africa=3, Middle East=4, Americas=5, Oceania=6″)
ct3 = data.groupby(‘region’).size()
print (ct3)
print(”)

#break into subgroups depending on region
subAsia=data_clean[(data_clean[‘region’])==1]
subEurope=data_clean[(data_clean[‘region’])==2]
subAfrica=data_clean[(data_clean[‘region’])==3]
subMiddleEast=data_clean[(data_clean[‘region’])==4]
subAmericas=data_clean[(data_clean[‘region’])==5]
subOceania=data_clean[(data_clean[‘region’])==6]

#get the correlation coefficient for Asia
print (‘Association between alcohol consumption and suicide rate for ASIA’)
print (scipy.stats.pearsonr(subAsia[‘alcconsumption’], subAsia
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for Asia
plt.figure(401)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subAsia)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in ASIA’)

#get the correlation coefficient for Europe
print (‘Association between alcohol consumption and suicide rate for EUROPE’)
print (scipy.stats.pearsonr(subEurope[‘alcconsumption’], subEurope
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for Europe
plt.figure(402)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subEurope)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in EUROPE’)

#get the correlation coefficient for Africa
print (‘Association between alcohol consumption and suicide rate for AFRICA’)
print (scipy.stats.pearsonr(subAfrica[‘alcconsumption’], subAfrica
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for Africa
plt.figure(403)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subAfrica)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in AFRICA’)

#get the correlation coefficient for Middle East
print (‘Association between alcohol consumption and suicide rate for MIDDLE EAST’)
print (scipy.stats.pearsonr(subMiddleEast[‘alcconsumption’], subMiddleEast
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for the Middle East
plt.figure(404)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subMiddleEast)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in MIDDLE EAST’)

#get the correlation coefficient for Americas
print (‘Association between alcohol consumption and suicide rate for AMERICAS’)
print (scipy.stats.pearsonr(subAmericas[‘alcconsumption’], subAmericas
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for the Americas
plt.figure(405)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subAmericas)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in AMERICAS’)

#get the correlation coefficient for Oceania
print (‘Association between alcohol consumption and suicide rate for OCEANIA’)
print (scipy.stats.pearsonr(subOceania[‘alcconsumption’], subOceania
[‘suicideper100th’]))
print(”)

#plot alcohol consumption and suicide rate as a scatterplot for Oceania
plt.figure(406)
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=subOceania)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Association Between Alcohol Consumption and Suicide Rate in OCEANIA’)

Generating a Correlation Coefficient

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates against alcohol consumption for the countries in the data set. Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables, and so Pearson correlation coefficient (r) can be used.

The scatterplot for the two variables seems to show a positive linear correlation:

Scatterplot_alcohol_suicide

The correlation coefficient is 0.35, indicating a weak positive linear correlation.

The r2 value is 0.12, indicating that only 12% of the variability in the suicide rate is due to alcohol consumption.

The p-value is 7.31e-07, or 0.000000731, indicating that the correlation is highly significant.

This suggests that an increased alcohol consumption in a country is correlated with an increase in the recorded suicide rate, although the relationship is not very strong.

It’s impossible to say whether this is causal (high consumption of alcohol leads to an increase in suicide) or other factors are involved (for example, a high rate of depression could cause both alcoholism and suicide).

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

#plot alcohol consumption and suicide rate as a scatterplot
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=data)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Scatterplot for the Association Between Alcohol Consumption and Suicide Rate’)

#clean NAs
data_clean=data.dropna()

#get the correlation coefficient
print (‘association between alcohol consumption and suicide rate’)
print (scipy.stats.pearsonr(data_clean[‘alcconsumption’], data_clean
[‘suicideper100th’]))

Running a Chi-Square Test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the polity scores for the countries in the data set.

Categorizing the variables

Both response variable (polity score) and explanatory variable (suicide rate per 100000) are quantitative variables. In order to perform a Chi-square test, I had to categorize both these variables.

I created 5 roughly equal categories for suicide rate:

  1. Very low: 0 to 4.5 suicides per 100000 (36 countries)
  2. Low: 4.5 to 7 suicides per 100000 ( 38 countries)
  3. Medium: 7 to 10 suicides per 100000 ( 40 countries)
  4. High: 10 to 14 suicides per 100000 (42 countries)
  5. Very high: >14 suicides per 100000 (35 countries)

I created a binary categorical variable for policy, dividing countries between “democratic” and “not democratic” (based on the Polity IV project):

  • 1: Democratic: 6 to 10 (88 countries)
  • 0: Not democratic: -10 to 5 (71 countries)

Contingency tables

Counts:

 

Column_count_suicide_polity

Column percentages:

Column_percent_suicide_polity

Chi square

The Chi-square revealed that suicide rate was not significantly associated with whether a country was a democracy. The Chi-square value was 3.17 and the p-value was 0.53, indicating that we should accept the null hypothesis in this instance.

Program read-out:

chi-square value, p value, expected counts
(3.1732413013999596, 0.52926374689586475, 4, array([[ 12.50314465, 12.50314465, 15.18238994, 16.52201258,
14.28930818],
[ 15.49685535, 15.49685535, 18.81761006, 20.47798742,
17.71069182]]))

 Python program

#import data analysis packages
import pandas
import numpy
import scipy.stats
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

##I only want to look at countries where stats exist for suicide rate
##get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (‘Number of countries’)
print(len(sub2))

# categorize suicide rates into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
#sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘suicide_cat’] = pandas.cut(sub1.suicideper100th, [0, 4.5, 7, 10, 14, 40], labels=[1,2,3,4,5])
sub2[‘suicide_cat’] = sub2.suicide_cat.astype(numpy.object)

# categorize polity score into ‘democracy’ (1) and ‘not democracy’ (0)
#sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[‘non-democratic’,’democratic’])
sub2[‘democracy’] = pandas.cut(sub1.polityscore, [-11, 5, 11], labels=[0,1])
sub2[‘democracy’] = sub2.democracy.astype(numpy.object)

#show frequency table for democracy
print (‘Frequency table for democracy’)
c2 = sub2[‘democracy’].value_counts(sort=False, dropna=False)
print(c2)
print ()

#show frequency table for suicide rate
print (‘Frequency table for suicide rate’)
c2 = sub2[‘suicide_cat’].value_counts(sort=False, dropna=False)
print(c2)
print ()

# contingency table of observed counts
ct1=pandas.crosstab(sub2[‘democracy’], sub2[‘suicide_cat’])
print (ct1)

# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)

# chi-square
print (‘chi-square value, p value, expected counts’)
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

Running an ANOVA test

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates and the alcohol consumption for the countries in the data set.

Categorizing the explanatory variable

Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables. In order to perform an Analysis of Variance (ANOVA) test, I had to categorize the explanatory variable (alcohol consumption):

  • Very low: 0 to 2 litres
  • Low: 2 to 5 litres
  • Medium: 5 to 8.5 litres
  • High: 8.5 to 12 litres
  • Very high: >12 litres

There are 191 countries, split roughly equally into these groups:

Frequency table for alcohol consumption
NaN 6
low 40
medium 39
very low 40
very high 32
high 34
dtype: int64

ANOVA

The ANOVA revealed that suicide rate and alcohol consumption were significantly associated, with an F-statistic of 5.750 and a p-value of 0.000223.

OLS Regression Results
==============================================================================
Dep. Variable: suicideper100th R-squared: 0.113
Model: OLS Adj. R-squared: 0.094
Method: Least Squares F-statistic: 5.750
Date: Tue, 10 Nov 2015 Prob (F-statistic): 0.000223
Time: 22:05:06 Log-Likelihood: -591.49
No. Observations: 185 AIC: 1193.
Df Residuals: 180 BIC: 1209.
Df Model: 4
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [95.0% Conf. Int.]
———————————————————————————————–
Intercept 9.8907 1.029 9.610 0.000 7.860 11.922
C(alcohol_cat)[T.low] -1.5358 1.400 -1.097 0.274 -4.298 1.227
C(alcohol_cat)[T.medium] -0.9225 1.408 -0.655 0.513 -3.701 1.856
C(alcohol_cat)[T.very high] 4.0614 1.478 2.748 0.007 1.145 6.978
C(alcohol_cat)[T.very low] -2.1734 1.400 -1.553 0.122 -4.936 0.589
==============================================================================
Omnibus: 60.043 Durbin-Watson: 0.250
Prob(Omnibus): 0.000 Jarque-Bera (JB): 143.003
Skew: 1.436 Prob(JB): 8.86e-32
Kurtosis: 6.209 Cond. No. 6.06
==============================================================================

Post-hoc Analysis

Post-hoc comparisons (Tukey HSD) revealed that countries in the “very high” alcohol consumption group, consuming more than 12 litres of alcohol per adult, had a significantly higher suicide rate that countries in the “medium”, “low”, and “very low” groups. No other comparisons were significant.

Multiple Comparison of Means – Tukey HSD,FWER=0.05
====================================================
group1 group2 meandiff lower upper reject
—————————————————-
high low -1.5358 -5.3935 2.3219 False
high medium -0.9225 -4.8029 2.9579 False
high very high 4.0614 -0.0119 8.1346 False
high very low -2.1734 -6.0311 1.6843 False
low medium 0.6133 -3.1083 4.3349 False
low very high 5.5972 1.6748 9.5195 True
low very low -0.6376 -4.3356 3.0604 False
medium very high 4.9839 1.0392 8.9285 True
medium very low -1.2509 -4.9725 2.4707 False
very high very low -6.2348 -10.1571 -2.3124 True
—————————————————-

Python program

#import data analysis package
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (‘Number of countries’)
print(len(sub2))

# categorize alcohol consumption into ‘very low’, ‘low’, ‘medium’, ‘high’, and ‘very high’
sub2[‘alcohol_cat’] = pandas.cut(sub1.alcconsumption, [0, 2, 5, 8.5, 12, 25], labels=[‘very low’, ‘low’, ‘medium’, ‘high’, ‘very high’])
sub2[‘alcohol_cat’] = sub2.alcohol_cat.astype(numpy.object)

#show frequency table for alcohol consumption categories
print (‘Frequency table for alcohol consumption’)
c2 = sub2[‘alcohol_cat’].value_counts(sort=False, dropna=False)
print(c2)
print ()

#create new subgroup for suicide and alcohol
sub3 = sub2[[‘suicideper100th’, ‘alcohol_cat’]].dropna()

# using ols function for calculating the F-statistic and associated p value
model2 = smf.ols(formula=’suicideper100th ~ C(alcohol_cat)’, data=sub3).fit()
print (model2.summary())

#print means
print (‘means for suicideper100th by alcohol consumption’)
m1= sub3.groupby(‘alcohol_cat’).mean()
print (m1)

#print standard deviations
print (‘standard deviations for suicideper100th by alcohol consumption’)
sd1 = sub3.groupby(‘alcohol_cat’).std()
print (sd1)

#perform post-hoc analysis
mc1 = multi.MultiComparison(sub3[‘suicideper100th’], sub3[‘alcohol_cat’])
res1 = mc1.tukeyhsd()
print(res1.summary())