Creating Graphs for Data

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to create graphs to display the variables and the relationships between them.

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc1 = sub2[‘suicideper100th’].describe()
print(desc1)
print ()

#Univariate histogram for suicide rate
#give the figure a unique number so it will display as a separate graph
plt.figure(101)
#plot suicide rate as a histogram
seaborn.distplot(sub2[‘suicideper100th’].dropna(), kde=False)
plt.xlabel(‘Suicides per 100,000 population’)
plt.ylabel(‘Number of countries’)
plt.title(‘Histogram for suicide rate’)

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc2 = sub2[‘alcconsumption’].describe()
print(desc2)
print ()

#Univariate histogram for alcohol consumption
#give the figure a unique number so it will display as a separate graph
plt.figure(102)
#plot alcohol consumption as a histogram
seaborn.distplot(sub2[‘alcconsumption’].dropna(), kde=False)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Number of countries’)
plt.title(‘Alcohol consumption’)

##next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc3 = sub2[‘polityscore’].describe()
print(desc3)
print ()
#Univariate histogram for polity score
#give the figure a unique number so it will display as a separate graph
plt.figure(103)
#plot polity score as a histogram
seaborn.distplot(sub2[‘polityscore’].dropna(), kde=False)
plt.xlabel(‘Polity score’)
plt.ylabel(‘Number of countries’)
plt.title(‘Polity score’)

#Now look at relationship between suicide and alcohol
#basic scatterplot: Q->Q
#give the figure a unique number so it will display as a seperate graph
plt.figure(104)
#plot alcohol consumption and suicide rate as a scatterplot
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=data)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Scatterplot for the Association Between Alcohol Consumption and Suicide Rate’)

# quartile split (use qcut function & ask for 4 groups – gives you quartile split)
print (‘Alcohol rate – 4 categories – quartiles’)
data[‘AlcoholGRP’]=pandas.qcut(data.alcconsumption, 4, labels=[“25th%tile”,”50%tile”,”75%tile”,”100%tile”])
c10 = data[‘AlcoholGRP’].value_counts(sort=False, dropna=True)
print(c10)
print ()

# bivariate bar graph C->Q
#give the figure a unique number so it will display as a separate graph
plt.figure(205)
#plot alcohol consumption as quartiles and display against suicide rate
seaborn.factorplot(x=’AlcoholGRP’, y=’suicideper100th’, data=data, kind=”bar”, ci=None)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Bar Chart for the Association Between Alcohol Consumption and Suicide Rate’)

Suicide rate

Description of the suicideper100th variable:

Suicide rate per 100,000 population, age adjusted
————————-
count 191.000000
mean 9.640839
std 6.300178
min 0.201449
25% 4.988449
50% 8.262893
75% 12.328551
max 35.752872
Name: suicideper100th, dtype: float64

The univariate graph for suicide rate:

UnivariateSuicide

This graph is unimodal, with its highest peak at around 7.5 to 10 suicides per 100000 of the population, which is close to the mean of 9.64. It seems to be skewed to the left as there are more countries with low suicide rates than with higher rates.

Alcohol consumption

Description of the alcconsumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
count 185.000000
mean 6.689730
std 4.908841
min 0.030000
25% 2.560000
50% 5.920000
75% 9.860000
max 23.010000
Name: alcconsumption, dtype: float64

The univariate graph for alcohol consumption:

UnivariateAlcohol

This graph is bimodal, with one peak at 0 to 2.5 litres consumed per adult, and a second peak at 7.5 to 10 litres. The mean is 6.69 litres. It seems to be skewed to the left as there are more countries with low alcohol intake than with higher rates.

Polity score

Description of the polityscore variable:

Polity (democracy) score
————————-
count 159.000000
mean 3.616352
std 6.320350
min -10.000000
25% -2.000000
50% 6.000000
75% 9.000000
max 10.000000
Name: polityscore, dtype: float64

The univariate graph for polity score:

UnivariatePolity

This graph is unimodal, with a peak at 10. It is highly skewed to the right as there are more countries with high polity scores (democracies) than with lower ones (autocracies).

Relationship between suicide rate and alcohol intake

My initial hypothesis was that a high alcohol rate would be correlated with a high suicide rate. In this case, the alcohol consumption rate is the explanatory variable that goes on the X axis, while the suicide rate is the response variable that goes on the Y axis.

Scatter plot for alcohol consumption and suicide:

Scatterplot_alcohol_suicide

This scatterplot shows a weak positive relationship between alcohol consumption and suicide rate. In other words, countries who consume high quantities of alcohol show a slight tendency to have a high suicide rate.

Bar chart for alcohol consumption (in quartiles) and suicide:

Barchart_alcohol_suicide

The bar chart shows the relationship in further detail. In the lower three quartiles, there is little relationship between alcohol consumption and suicide rate. However, there is a definite trend for countries in the highest quartile for alcohol consumption to have a high rate of suicide.

Further research is required before any definite conclusions can be drawn.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s