Making Data Management Decisions

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to decide how I wanted to manage the variables suicideper100thalcconsumptionpolityscore, and region.

Here is the Python program:

#import data analysis package
import pandas
import numpy

# bug fix for display formats to avoid run time errors
pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Number of observations (rows)
print (“Number of countries:”)
print(len(sub2))

#add spacing
print ()

#first look at suicide rates
print (“Suicide rate per 100,000 population, age adjusted”)
print (‘————————-‘)
#find lowest value
minSuicide = min(sub2.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(sub2.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(sub2.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate’)
sub2[‘suicideSplit’] = pandas.cut(sub2.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c1 = sub2[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c1)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate”)
p1 = sub2[‘suicideSplit’].value_counts(sort=False)*100/len(sub2)
print(p1)
print ()

#next look at alcohol consumption
print (“Alcohol consumption per adult (age 15+), in litres”)
print (‘————————-‘)
#find lowest value
minAlcohol = min(sub2.alcconsumption)
print (‘Minimum:’,minAlcohol )
#find highest value
maxAlcohol = max(sub2.alcconsumption)
print (‘Maximum:’,maxAlcohol )
#find median
medianAlcohol = numpy.median(sub2.alcconsumption)
print (‘Average (median):’,medianAlcohol)
print ()
# categorize alcohol consumption rate based on customized splits using cut function
#put the alcohol consumption rate into “bins” for values between 0 and 25
#also show blank (NaN) values
print (‘Frequency table for alcohol consumption’)
sub2[‘alcoholSplit’] = pandas.cut(sub2.alcconsumption, [0, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25])
c2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)
print(c2)
print ()
#percentage distribution for alcohol consumption rate
print (“Percentage table for alcohol consumption”)
p2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p2)
print ()

#next look at polity score
print (“Polity (democracy) score”)
print (‘————————-‘)
#describe categories from the Polity IV Project
print(“-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy”)
print ()
#find lowest value
minPolity = min(sub2.polityscore)
print (‘Minimum:’,minPolity )
#find highest value
maxPolity = max(sub2.polityscore)
print (‘Maximum:’,maxPolity )
#find median
medianPolity = numpy.median(sub2.polityscore)
print (‘Average (median):’,medianPolity)
print ()
##display counts and percentages for polity score categories
#also show blank (NaN) values
print (‘Frequency table for polity score’)
sub2[‘politySplit’] = pandas.cut(sub2.polityscore, [-11, -6, 0, 5, 9, 10])
c3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)
print(c3)
print ()
#percentage distribution for polity score
print (“Percentage table for polity score”)
p3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)*100/len(sub2)
print(p3)
print ()

#next look at regions
print (“Geographical regions”)
print (‘————————-‘)
#describe categories
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
print()
print(‘Frequency table for region’)
c4 = sub2[‘region’].value_counts(sort=False, dropna=False)
print (c4)
print ()
print (“Percentage table for region”)
p4 = sub2[‘region’].value_counts(sort=False)*100/len(sub2)
print (p4)
print ()

#Finally, look at suicide rates only in Asia
print (“Suicide rate per 100,000 population, age adjusted (Asia only)”)
print (‘————————-‘)
# get subset of values for Asia
sub3=sub2[sub2[‘region’]==1]
AsiaStats=sub3.copy()
#find lowest value
minSuicide = min(AsiaStats.suicideper100th)
print (‘Minimum:’,minSuicide )
#find highest value
maxSuicide = max(AsiaStats.suicideper100th)
print (‘Maximum:’,maxSuicide )
#find median
medianSuicide = numpy.median(AsiaStats.suicideper100th)
print (‘Average (median):’,medianSuicide)
print ()
# categorize suicide rate based on customized splits using cut function
# put the suicide rate into “bins” for values between 0 and 40
print (‘Frequency table for suicide rate (Asia only)’)
AsiaStats[‘suicideSplit’] = pandas.cut(AsiaStats.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])
c5 = AsiaStats[‘suicideSplit’].value_counts(sort=False, dropna=True)
print(c5)
print ()
#percentage distribution for suicide rate
print (“Percentage table for suicide rate (Asia only)”)
p5 = AsiaStats[‘suicideSplit’].value_counts(sort=False)*100/len(AsiaStats)
print(p5)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

First, I got a subset of the Gapminder data set, including only those countries that had a value for the suicide rate:

Number of countries:
191

The first variable I looked at was the suicide rate:

Suicide rate per 100,000 population, age adjusted
————————-
Minimum: 0.2014487237
Maximum: 35.752872467
Average (median): 8.2628927231

Frequency table for suicide rate
(0, 5] 49
(5, 10] 65
(10, 15] 51
(15, 20] 12
(20, 25] 6
(25, 30] 6
(30, 35] 1
(35, 40] 1
dtype: int64

Percentage table for suicide rate
(0, 5] 25.654450
(5, 10] 34.031414
(10, 15] 26.701571
(15, 20] 6.282723
(20, 25] 3.141361
(25, 30] 3.141361
(30, 35] 0.523560
(35, 40] 0.523560
dtype: float64

The lowest reported suicide rate is 0.2 (to 2 decimal places) and the highest is 35.75. The median value is 8.26.

Because the suicide rate is a continuous variable (a typical value is something like “4.2170763016”), it is useful to put these values into bins with a range of 5 before running a frequency distribution. As we can see, the most common range is between 5 and 10 reported suicides per 100000 of the population. 65 countries, or 34.03% of the total, have between 5 and 10 suicides per 100000 of the population. Only two countries have more than 30 reported suicides per 100000 of the population. There are no “not a number”(NaN) values, because I have already stripped out any countries without a recorded suicide rate.

I next looked at the alcohol consumption variable:

Alcohol consumption per adult (age 15+), in litres
————————-
Minimum: 0.03
Maximum: 23.01
Average (median): 6.16

Frequency table for alcohol consumption
(0, 2.5] 45
(2.5, 5] 35
(5, 7.5] 31
(7.5, 10] 30
(10, 12.5] 20
(12.5, 15] 13
(15, 17.5] 8
(17.5, 20] 2
(20, 22.5] 0
(22.5, 25] 1
NaN 6
dtype: int64

Percentage table for alcohol consumption
(0, 2.5] 23.560209
(2.5, 5] 18.324607
(5, 7.5] 16.230366
(7.5, 10] 15.706806
(10, 12.5] 10.471204
(12.5, 15] 6.806283
(15, 17.5] 4.188482
(17.5, 20] 1.047120
(20, 22.5] 0.000000
(22.5, 25] 0.523560
NaN 3.141361
dtype: float64

The lowest alcohol consumption rate is 0.03 litres per adult, and the highest is 23.01. The median value is 6.16 litres.

Again, the alcohol consumption figure is a continuous variable, so I put these values into bins with a range of 2.5 litres. As we can see, the most common range is between 0 and 2.5 litres. In 45 countries, or 23.56% of the total, each person over the age of 15 drank an average of between 0 and 2.5 litres of pure alcohol. Only one country has an alcohol consumption rate of over 20 litres. No data is available for 6 countries (2.81%).

I next created a frequency table for the polity (democracy) score:

Polity (democracy) score
————————-
-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy

Minimum: -10.0
Maximum: 10.0
Average (median): 8.0

Frequency table for polity score
(-11, -6] 23
(-6, 0] 29
(0, 5] 19
(5, 9] 56
(9, 10] 32
NaN 32
dtype: int64

Percentage table for polity score
(-11, -6] 12.041885
(-6, 0] 15.183246
(0, 5] 9.947644
(5, 9] 29.319372
(9, 10] 16.753927
NaN 16.753927
dtype: float64

The polity score is based on the 2009 Polity IV Project, which is sponsored by the Political Instability Task Force (PITF). PITF defines countries as democracies, anocracies, or autocracies depending on this score (see http://www.systemicpeace.org/polity/polity4x.htm). I thought it useful to bin the polity scores to correspond with the PITF categories.

We can see that 32 countries, or 16.75% of the total, score as Full Democracies. The most common category is Democracy, with 56 countries or 29.32% of the total. 19 countries (9.95%) scored as Open Anocracies, 29 (15.18%) as Closed Autocracies, and 23 (12.04%) as Autocracies. No data was available for 32 countries (16.75%).

I next created a frequency table for region:

Geographical regions
————————-
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7

Frequency table for region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64

Percentage table for region
1 16.230366
2 24.083770
3 25.654450
4 8.376963
5 11.518325
6 6.282723
7 7.853403
dtype: float64

Because these are “dummy” values, there’s no need to calculate medians or to put values into bins.

The “region” variable did not exist in the original Gapminder data set, but I created it as a potentially useful category for grouping countries. As an example of this, here is a frequency distribution of suicide rates for only Asian countries:

Suicide rate per 100,000 population, age adjusted (Asia only)
————————-
Minimum: 1.3809646368
Maximum: 28.1040458679
Average (median): 11.3961114883

Frequency table for suicide rate (Asia only)
(0, 5] 6
(5, 10] 8
(10, 15] 8
(15, 20] 4
(20, 25] 2
(25, 30] 3
(30, 35] 0
(35, 40] 0
dtype: int64

Percentage table for suicide rate (Asia only)
(0, 5] 19.354839
(5, 10] 25.806452
(10, 15] 25.806452
(15, 20] 12.903226
(20, 25] 6.451613
(25, 30] 9.677419
(30, 35] 0.000000
(35, 40] 0.000000
dtype: float64

We can see some differences between the Asian and the worldwide data. Asian countries have a median of 11.4  suicides per 100000 of the population, much higher than the worldwide median of 8.26. The frequency distribution is also different; a lower percentage of Asian countries have a suicide rate under 10 per 100000, while a much higher percentage of Asian countries are in the 15 to 30 categories. However, no Asian country has a suicide rate of over 30 per 100000.

A similar breakdown could be done for other regions or other variables.

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s