Generating a Correlation Coefficient

This is part of my coursework for Data Analysis Tools.

I am using Python to analyse the data available from Gapminder. I want to compare the suicide rates against alcohol consumption for the countries in the data set. Both response variable (suicide rate per 100000) and explanatory variable (alcohol consumption per adult in litres) are quantitative variables, and so Pearson correlation coefficient (r) can be used.

The scatterplot for the two variables seems to show a positive linear correlation:

Scatterplot_alcohol_suicide

The correlation coefficient is 0.35, indicating a weak positive linear correlation.

The r2 value is 0.12, indicating that only 12% of the variability in the suicide rate is due to alcohol consumption.

The p-value is 7.31e-07, or 0.000000731, indicating that the correlation is highly significant.

This suggests that an increased alcohol consumption in a country is correlated with an increase in the recorded suicide rate, although the relationship is not very strong.

It’s impossible to say whether this is causal (high consumption of alcohol leads to an increase in suicide) or other factors are involved (for example, a high rate of depression could cause both alcoholism and suicide).

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import scipy
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

#plot alcohol consumption and suicide rate as a scatterplot
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=data)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Scatterplot for the Association Between Alcohol Consumption and Suicide Rate’)

#clean NAs
data_clean=data.dropna()

#get the correlation coefficient
print (‘association between alcohol consumption and suicide rate’)
print (scipy.stats.pearsonr(data_clean[‘alcconsumption’], data_clean
[‘suicideper100th’]))

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s