Creating Graphs for Data

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to create graphs to display the variables and the relationships between them.

Python program

#import data analysis package
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc1 = sub2[‘suicideper100th’].describe()
print(desc1)
print ()

#Univariate histogram for suicide rate
#give the figure a unique number so it will display as a separate graph
plt.figure(101)
#plot suicide rate as a histogram
seaborn.distplot(sub2[‘suicideper100th’].dropna(), kde=False)
plt.xlabel(‘Suicides per 100,000 population’)
plt.ylabel(‘Number of countries’)
plt.title(‘Histogram for suicide rate’)

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc2 = sub2[‘alcconsumption’].describe()
print(desc2)
print ()

#Univariate histogram for alcohol consumption
#give the figure a unique number so it will display as a separate graph
plt.figure(102)
#plot alcohol consumption as a histogram
seaborn.distplot(sub2[‘alcconsumption’].dropna(), kde=False)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Number of countries’)
plt.title(‘Alcohol consumption’)

##next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#print a description, including count, mean, standard devation, min, max, and percentiles
desc3 = sub2[‘polityscore’].describe()
print(desc3)
print ()
#Univariate histogram for polity score
#give the figure a unique number so it will display as a separate graph
plt.figure(103)
#plot polity score as a histogram
seaborn.distplot(sub2[‘polityscore’].dropna(), kde=False)
plt.xlabel(‘Polity score’)
plt.ylabel(‘Number of countries’)
plt.title(‘Polity score’)

#Now look at relationship between suicide and alcohol
#basic scatterplot: Q->Q
#give the figure a unique number so it will display as a seperate graph
plt.figure(104)
#plot alcohol consumption and suicide rate as a scatterplot
scat1 = seaborn.regplot(x=”alcconsumption”, y=”suicideper100th”, data=data)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Scatterplot for the Association Between Alcohol Consumption and Suicide Rate’)

# quartile split (use qcut function & ask for 4 groups – gives you quartile split)
print (‘Alcohol rate – 4 categories – quartiles’)
data[‘AlcoholGRP’]=pandas.qcut(data.alcconsumption, 4, labels=[“25th%tile”,”50%tile”,”75%tile”,”100%tile”])
c10 = data[‘AlcoholGRP’].value_counts(sort=False, dropna=True)
print(c10)
print ()

# bivariate bar graph C->Q
#give the figure a unique number so it will display as a separate graph
plt.figure(205)
#plot alcohol consumption as quartiles and display against suicide rate
seaborn.factorplot(x=’AlcoholGRP’, y=’suicideper100th’, data=data, kind=”bar”, ci=None)
plt.xlabel(‘Alcohol consumption per adult in litres’)
plt.ylabel(‘Suicide rate per 100,000 population’)
plt.title(‘Bar Chart for the Association Between Alcohol Consumption and Suicide Rate’)

Suicide rate

Description of the suicideper100th variable:

Suicide rate per 100,000 population, age adjusted
————————-
count 191.000000
mean 9.640839
std 6.300178
min 0.201449
25% 4.988449
50% 8.262893
75% 12.328551
max 35.752872
Name: suicideper100th, dtype: float64

The univariate graph for suicide rate:

UnivariateSuicide

This graph is unimodal, with its highest peak at around 7.5 to 10 suicides per 100000 of the population, which is close to the mean of 9.64. It seems to be skewed to the left as there are more countries with low suicide rates than with higher rates.

Alcohol consumption

Description of the alcconsumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
count 185.000000
mean 6.689730
std 4.908841
min 0.030000
25% 2.560000
50% 5.920000
75% 9.860000
max 23.010000
Name: alcconsumption, dtype: float64

The univariate graph for alcohol consumption:

UnivariateAlcohol

This graph is bimodal, with one peak at 0 to 2.5 litres consumed per adult, and a second peak at 7.5 to 10 litres. The mean is 6.69 litres. It seems to be skewed to the left as there are more countries with low alcohol intake than with higher rates.

Polity score

Description of the polityscore variable:

Polity (democracy) score
————————-
count 159.000000
mean 3.616352
std 6.320350
min -10.000000
25% -2.000000
50% 6.000000
75% 9.000000
max 10.000000
Name: polityscore, dtype: float64

The univariate graph for polity score:

UnivariatePolity

This graph is unimodal, with a peak at 10. It is highly skewed to the right as there are more countries with high polity scores (democracies) than with lower ones (autocracies).

Relationship between suicide rate and alcohol intake

My initial hypothesis was that a high alcohol rate would be correlated with a high suicide rate. In this case, the alcohol consumption rate is the explanatory variable that goes on the X axis, while the suicide rate is the response variable that goes on the Y axis.

Scatter plot for alcohol consumption and suicide:

Scatterplot_alcohol_suicide

This scatterplot shows a weak positive relationship between alcohol consumption and suicide rate. In other words, countries who consume high quantities of alcohol show a slight tendency to have a high suicide rate.

Bar chart for alcohol consumption (in quartiles) and suicide:

Barchart_alcohol_suicide

The bar chart shows the relationship in further detail. In the lower three quartiles, there is little relationship between alcohol consumption and suicide rate. However, there is a definite trend for countries in the highest quartile for alcohol consumption to have a high rate of suicide.

Further research is required before any definite conclusions can be drawn.

Advertisements

Making Data Management Decisions

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to decide how I wanted to manage the variables suicideper100thalcconsumptionpolityscore, and region.

Here is the Python program:

#import data analysis package
import pandas
import numpy

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (“Number of countries:”)
print(len(sub2))

#add spacing
print ()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#find lowest value
minSuicide = min(sub2.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(sub2.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(sub2.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate’)
sub2[‘suicideSplit’] = pandas.cut(sub2.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c1 = sub2[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c1)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate”)
p1 = sub2[‘suicideSplit’].value_counts(sort=False)*100/len(sub2)
print(p1)
print ()

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#find lowest value
minAlcohol = min(sub2.alcconsumption)
print (‘Minimum:’,minAlcohol )
#find highest value
maxAlcohol = max(sub2.alcconsumption)
print (‘Maximum:’,maxAlcohol )
#find median
medianAlcohol = numpy.median(sub2.alcconsumption)
print (‘Average (median):’,medianAlcohol)
print ()
# categorize alcohol consumption rate based on customized splits using cut function
#put the alcohol consumption rate into “bins” for values between 0 and 25
#also show blank (NaN) values
print (‘Frequency table for alcohol consumption’)
sub2[‘alcoholSplit’] = pandas.cut(sub2.alcconsumption, [0, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25])
c2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)
print(c2)
print ()
#percentage distribution for alcohol consumption rate
print (“Percentage table for alcohol consumption”)
p2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p2)
print ()

#next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#describe categories from the Polity IV Project
print(“-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy”)
print ()
#find lowest value
minPolity = min(sub2.polityscore)
print (‘Minimum:’,minPolity )
#find highest value
maxPolity = max(sub2.polityscore)
print (‘Maximum:’,maxPolity )
#find median
medianPolity = numpy.median(sub2.polityscore)
print (‘Average (median):’,medianPolity)
print ()
##display counts and percentages for polity score categories
#also show blank (NaN) values
print (‘Frequency table for polity score’)
sub2[‘politySplit’] = pandas.cut(sub2.polityscore, [-11, -6, 0, 5, 9, 10])
c3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)
print(c3)
print ()
#percentage distribution for polity score
print (“Percentage table for polity score”)
p3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p3)
print ()

#next look at regions
print (“Geographical regions”)
print (‘————————-‘)
#describe categories
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
print()
print(‘Frequency table for region’)
c4 = sub2[‘region’].value_counts(sort=False, dropna=False)
print (c4)
print ()
print (“Percentage table for region”)
p4 = sub2[‘region’].value_counts(sort=False)*100/len(sub2)
print (p4)
print ()

#Finally, look at suicide rates only in Asia
print (“Suicide rate per 100,000 population, age adjusted (Asia only)”)
print (‘————————-‘)
# get subset of values for Asia
sub3=sub2[sub2[‘region’]==1]
AsiaStats=sub3.copy()
#find lowest value
minSuicide = min(AsiaStats.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(AsiaStats.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(AsiaStats.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate (Asia only)’)
AsiaStats[‘suicideSplit’] = pandas.cut(AsiaStats.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c5 = AsiaStats[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c5)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate (Asia only)”)
p5 = AsiaStats[‘suicideSplit’].value_counts(sort=False)*100/len(AsiaStats)
print(p5)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

First, I got a subset of the Gapminder data set, including only those countries that had a value for the suicide rate:

Number of countries:
191

The first variable I looked at was the suicide rate:

Suicide rate per 100,000 population, age adjusted
————————-
Minimum: 0.2014487237
Maximum: 35.752872467
Average (median): 8.2628927231

Frequency table for suicide rate
(0, 5] 49
(5, 10] 65
(10, 15] 51
(15, 20] 12
(20, 25] 6
(25, 30] 6
(30, 35] 1
(35, 40] 1
dtype: int64

Percentage table for suicide rate
(0, 5] 25.654450
(5, 10] 34.031414
(10, 15] 26.701571
(15, 20] 6.282723
(20, 25] 3.141361
(25, 30] 3.141361
(30, 35] 0.523560
(35, 40] 0.523560
dtype: float64

The lowest reported suicide rate is 0.2 (to 2 decimal places) and the highest is 35.75. The median value is 8.26.

Because the suicide rate is a continuous variable (a typical value is something like “4.2170763016”), it is useful to put these values into bins with a range of 5 before running a frequency distribution. As we can see, the most common range is between 5 and 10 reported suicides per 100000 of the population. 65 countries, or 34.03% of the total, have between 5 and 10 suicides per 100000 of the population. Only two countries have more than 30 reported suicides per 100000 of the population. There are no “not a number”(NaN) values, because I have already stripped out any countries without a recorded suicide rate.

I next looked at the alcohol consumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
Minimum: 0.03
Maximum: 23.01
Average (median): 6.16

Frequency table for alcohol consumption
(0, 2.5] 45
(2.5, 5] 35
(5, 7.5] 31
(7.5, 10] 30
(10, 12.5] 20
(12.5, 15] 13
(15, 17.5] 8
(17.5, 20] 2
(20, 22.5] 0
(22.5, 25] 1
NaN 6
dtype: int64

Percentage table for alcohol consumption
(0, 2.5] 23.560209
(2.5, 5] 18.324607
(5, 7.5] 16.230366
(7.5, 10] 15.706806
(10, 12.5] 10.471204
(12.5, 15] 6.806283
(15, 17.5] 4.188482
(17.5, 20] 1.047120
(20, 22.5] 0.000000
(22.5, 25] 0.523560
NaN 3.141361
dtype: float64

The lowest alcohol consumption rate is 0.03 litres per adult, and the highest is 23.01. The median value is 6.16 litres.

Again, the alcohol consumption figure is a continuous variable, so I put these values into bins with a range of 2.5 litres. As we can see, the most common range is between 0 and 2.5 litres. In 45 countries, or 23.56% of the total, each person over the age of 15 drank an average of between 0 and 2.5 litres of pure alcohol. Only one country has an alcohol consumption rate of over 20 litres. No data is available for 6 countries (2.81%).

I next created a frequency table for the polity (democracy) score:

Polity (democracy) score
————————-
-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy

Minimum: -10.0
Maximum: 10.0
Average (median): 8.0

Frequency table for polity score
(-11, -6] 23
(-6, 0] 29
(0, 5] 19
(5, 9] 56
(9, 10] 32
NaN 32
dtype: int64

Percentage table for polity score
(-11, -6] 12.041885
(-6, 0] 15.183246
(0, 5] 9.947644
(5, 9] 29.319372
(9, 10] 16.753927
NaN 16.753927
dtype: float64

The polity score is based on the 2009 Polity IV Project, which is sponsored by the Political Instability Task Force (PITF). PITF defines countries as democracies, anocracies, or autocracies depending on this score (see http://www.systemicpeace.org/polity/polity4x.htm). I thought it useful to bin the polity scores to correspond with the PITF categories.

We can see that 32 countries, or 16.75% of the total, score as Full Democracies. The most common category is Democracy, with 56 countries or 29.32% of the total. 19 countries (9.95%) scored as Open Anocracies, 29 (15.18%) as Closed Autocracies, and 23 (12.04%) as Autocracies. No data was available for 32 countries (16.75%).

I next created a frequency table for region:

Geographical regions
————————-
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7

Frequency table for region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64

Percentage table for region
1 16.230366
2 24.083770
3 25.654450
4 8.376963
5 11.518325
6 6.282723
7 7.853403
dtype: float64

Because these are “dummy” values, there’s no need to calculate medians or to put values into bins.

The “region” variable did not exist in the original Gapminder data set, but I created it as a potentially useful category for grouping countries. As an example of this, here is a frequency distribution of suicide rates for only Asian countries:

Suicide rate per 100,000 population, age adjusted (Asia only)
————————-
Minimum: 1.3809646368
Maximum: 28.1040458679
Average (median): 11.3961114883

Frequency table for suicide rate (Asia only)
(0, 5] 6
(5, 10] 8
(10, 15] 8
(15, 20] 4
(20, 25] 2
(25, 30] 3
(30, 35] 0
(35, 40] 0
dtype: int64

Percentage table for suicide rate (Asia only)
(0, 5] 19.354839
(5, 10] 25.806452
(10, 15] 25.806452
(15, 20] 12.903226
(20, 25] 6.451613
(25, 30] 9.677419
(30, 35] 0.000000
(35, 40] 0.000000
dtype: float64

We can see some differences between the Asian and the worldwide data. Asian countries have a median of 11.4  suicides per 100000 of the population, much higher than the worldwide median of 8.26. The frequency distribution is also different; a lower percentage of Asian countries have a suicide rate under 10 per 100000, while a much higher percentage of Asian countries are in the 15 to 30 categories. However, no Asian country has a suicide rate of over 30 per 100000.

A similar breakdown could be done for other regions or other variables.

Running my first Python program

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder.

Here is my Python program:

#import data analysis package
import pandas

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

print (“Number of countries:”)
print(len(data)) # Number of observations (rows)
print (“Number of variables:”)
print(len(data.columns)) # Number of variables (columns)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#display counts and percentages for suicide rates
print (“counts for suicide rate per 100,000 population, age adjusted”)
ct1 = data.groupby(‘suicideper100th’).size()
print (ct1)
print (“percentages for suicide rate per 100,000 population, age adjusted”)
pt1 = data.groupby(‘suicideper100th’).size()*100/len(data)
print (pt1)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Total number of countries where suicide stats exist
print (“Total number of countries where suicide stats exist”)
print(len(sub2))

#display counts and percentages for polity score
print (“counts for polity score”)
ct2 = sub2.groupby(‘polityscore’).size()
print (ct2)
print (“percentages for polity score”)
pt2 = sub2.groupby(‘polityscore’).size()*100/len(data)
print (pt2)

#display counts and percentages for regions
print(‘counts for region’)
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
ct3 = sub2.groupby(‘region’).size()
print (ct3)
print (“percentages for region”)
pt3 = sub2.groupby(‘region’).size()*100/len(data)
print (pt3)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

I first looked at the number of countries and variables:

Number of countries:
213
Number of variables:
17

The columns include all the countries from the Gapminder data set. The rows include the 16 variables from the Gapminder data set, plus an extra variable called “region”.

I then produced frequency rate for three variables. Because my research project is looking at factors that influence suicide, the first variable is suicideper100th:

counts for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 1
0.523528 1
1.370002 1
1.380965 1
1.392951 1
1.498057 1
1.519248 1
1.574350 1
1.658908 1
1.799904 1
1.922485 1
2.034178 1
2.109414 1
2.161843 1
2.206169 1
2.234896 1
2.515721 1
2.648981 1
2.816705 1
3.108603 1
3.146814 1
3.374416 1
3.563325 1
3.576478 1
3.716739 1
3.741588 1
3.940259 1
4.079525 1
4.119620 1
4.217076 1
..
14.554677 1
14.680936 1
14.713020 1
14.776250 1
15.538490 1
15.542603 1
15.714571 1
15.953850 1
16.234370 1
16.913248 1
16.959240 1
17.032646 1
18.583826 1
18.946930 1
18.954570 1
19.422610 1
20.162010 1
20.317930 1
20.369590 1
20.747431 1
22.353479 1
22.404560 1
25.404600 1
26.219198 1
26.874690 1
27.874160 1
28.104046 1
29.864164 1
33.341860 1
35.752872 1
dtype: int64
percentages for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 0.469484
0.523528 0.469484
1.370002 0.469484
1.380965 0.469484
1.392951 0.469484
1.498057 0.469484
1.519248 0.469484
1.574350 0.469484
1.658908 0.469484
1.799904 0.469484
1.922485 0.469484
2.034178 0.469484
2.109414 0.469484
2.161843 0.469484
2.206169 0.469484
2.234896 0.469484
2.515721 0.469484
2.648981 0.469484
2.816705 0.469484
3.108603 0.469484
3.146814 0.469484
3.374416 0.469484
3.563325 0.469484
3.576478 0.469484
3.716739 0.469484
3.741588 0.469484
3.940259 0.469484
4.079525 0.469484
4.119620 0.469484
4.217076 0.469484

14.554677 0.469484
14.680936 0.469484
14.713020 0.469484
14.776250 0.469484
15.538490 0.469484
15.542603 0.469484
15.714571 0.469484
15.953850 0.469484
16.234370 0.469484
16.913248 0.469484
16.959240 0.469484
17.032646 0.469484
18.583826 0.469484
18.946930 0.469484
18.954570 0.469484
19.422610 0.469484
20.162010 0.469484
20.317930 0.469484
20.369590 0.469484
20.747431 0.469484
22.353479 0.469484
22.404560 0.469484
25.404600 0.469484
26.219198 0.469484
26.874690 0.469484
27.874160 0.469484
28.104046 0.469484
29.864164 0.469484
33.341860 0.469484
35.752872 0.469484
dtype: float64

The suicide rate is the mortality due to self-inflicted injury, per 100 000 standard population, adjusted for age. It is taken from a combination of time series from WHO Violence and Injury Prevention (VIP) and data from WHO Global Burden of Disease 2002 and 2004. [Further information about worldwide suicide rates can be found at the WHO web site.]

As you can see, the suicide rate is a continuous variable, so it is hard to see patterns at this stage. I may get a better picture at a later stage, when I learn to put continuous variables into ranges.

Suicide rates are not available for all countries in the Gapminder data set, so I decided to look only at those countries where the suicide rate was available:

Total number of countries where suicide stats exist
191

I next created a frequency table for polityscore, using the subset of countries:

counts for polity score
polityscore
-10 2
-9 4
-8 2
-7 12
-6 3
-5 2
-4 6
-3 6
-2 5
-1 4
0 6
1 3
2 3
3 2
4 4
5 7
6 10
7 13
8 19
9 14
10 32
dtype: int64
percentages for polity score
polityscore
-10 0.938967
-9 1.877934
-8 0.938967
-7 5.633803
-6 1.408451
-5 0.938967
-4 2.816901
-3 2.816901
-2 2.347418
-1 1.877934
0 2.816901
1 1.408451
2 1.408451
3 0.938967
4 1.877934
5 3.286385
6 4.694836
7 6.103286
8 8.920188
9 6.572770
10 15.023474
dtype: float64

The polity score, or democracy score, is based on the 2009 Polity IV Project. It is calculated by subtracting an autocracy score from a democracy score. The summary measure of a country’s democratic and free nature. -10 is the lowest value, and 10 the highest. [Further information about the Polity Project can be found at the Center for Systemic Peace web site.]

We can see that most countries scored highly on this measure. 32 countries, or 15% of the total, scored the highest possible score of 10, which means that they have open elections, checks on executive authority, and other measures of democracy. 23 countries (12% of the total) have a score below -5, which would make them autocracies.

I next created a frequency table for region, again using the subset of countries:

counts for region
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7
region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64
percentages for region
region
1 14.553991
2 21.596244
3 23.004695
4 7.511737
5 10.328638
6 5.633803
7 7.042254
dtype: float64

The region variable did not exist in the original Gapminder data set, but I thought it would be useful to see patterns across the geographical regions of the world. I assigned a number to each region:

  1. Asia
  2. Europe
  3. Africa
  4. Middle East
  5. North and Central America
  6. South America
  7. Oceania

As we can see, Africa has the largest number of countries in this data set (49 countries, or 23% of the total), followed closely by Europe (46 countries, or 21.59% of the total). The smallest number of countries are found in South America (12, or 5.63% of the total).

Selecting a data set

This is part of my coursework for Data Management and Visualization.

I decided to use the codebook for Gapminder, I’ve decided to look at factors that affect the suicide rates of different countries.

I am particularly interested in the relationship between alcohol consumption and the suicide rate. My hypothesis is that alcohol consumption will be positively correlated to suicide rates. Alcohol can lower inhibitions and therefore increase the likelihood that someone will follow through on suicidal thoughts. Also, factors that increase the likelihood of alcoholism (for example, depression) might also be associated with suicide rates.

Some previous studies have found that alcohol dependence is an important risk factor for suicidal behaviour, for example, Sher 2005. However, the link has not been shown conclusively, for example, Isacsson 2000 found no correlation between alcohol consumption and suicide. Lester 1995, in a study of 13 nations, found that “suicide and homicide rates were usually, but not always, positively associated with the per capita consumption of alcohol”. However, Lester’s study uses old data, and I’m curious to know if this correlation holds up with 21st century data. 

It’s likely that other factors moderate the influence of alcoholism and suicide. These factors include income per person, employment rate, democracy score, and urban rate.

There may be other confounding factors, such as under-reporting of suicide or cultural factors around alcohol consumption. However, these are beyond the scope of this study.

I have created a personal codebook from the Gapminder data. My personal codebook includes variables for suicideper100TH (mortality due to self-inflicted injury, per 100 000 standard population, age adjusted) and alcconsumption (recorded and estimated average alcohol consumption, adult (15+) per capita consumption in litres pure alcohol). It also includes variables for income per person, polity (democracy) score, employment rate, and urbanization rate.