*This is part of my coursework for Data Management and Visualization.*

I am using Python to analyse the data available from Gapminder. Following on from last week’s assignment, I had to decide how I wanted to manage the variables **suicideper100th**, **alcconsumption**, **polityscore**, and **region**.

Here is the Python program:

#import data analysis package

import pandas

import numpy

# bug fix for display formats to avoid run time errors

pandas.set_option(‘display.float_format’, lambda x:’%f’%x)

#import the entire data set to memory

data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

#ensure that variables are numeric

data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)

data[‘alcconsumption’] = data[‘alcconsumption’].convert_objects(convert_numeric=True)

data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)

data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#I only want to look at countries where stats exist for suicide rate

#get subset where suicide rates exist

sub1=data[data[‘suicideper100th’]>0]

sub2=sub1.copy()

#Number of observations (rows)

print (“Number of countries:”)

print(len(sub2))

#add spacing

print ()

#first look at suicide rates

print (“Suicide rate per 100,000 population, age adjusted”)

print (‘————————-‘)

#find lowest value

minSuicide = min(sub2.suicideper100th)

print (‘Minimum:’,minSuicide )

#find highest value

maxSuicide = max(sub2.suicideper100th)

print (‘Maximum:’,maxSuicide )

#find median

medianSuicide = numpy.median(sub2.suicideper100th)

print (‘Average (median):’,medianSuicide)

print ()

# categorize suicide rate based on customized splits using cut function

# put the suicide rate into “bins” for values between 0 and 40

print (‘Frequency table for suicide rate’)

sub2[‘suicideSplit’] = pandas.cut(sub2.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])

c1 = sub2[‘suicideSplit’].value_counts(sort=False, dropna=True)

print(c1)

print ()

#percentage distribution for suicide rate

print (“Percentage table for suicide rate”)

p1 = sub2[‘suicideSplit’].value_counts(sort=False)*100/len(sub2)

print(p1)

print ()

#next look at alcohol consumption

print (“Alcohol consumption per adult (age 15+), in litres”)

print (‘————————-‘)

#find lowest value

minAlcohol = min(sub2.alcconsumption)

print (‘Minimum:’,minAlcohol )

#find highest value

maxAlcohol = max(sub2.alcconsumption)

print (‘Maximum:’,maxAlcohol )

#find median

medianAlcohol = numpy.median(sub2.alcconsumption)

print (‘Average (median):’,medianAlcohol)

print ()

# categorize alcohol consumption rate based on customized splits using cut function

#put the alcohol consumption rate into “bins” for values between 0 and 25

#also show blank (NaN) values

print (‘Frequency table for alcohol consumption’)

sub2[‘alcoholSplit’] = pandas.cut(sub2.alcconsumption, [0, 2.5, 5, 7.5, 10, 12.5, 15, 17.5, 20, 22.5, 25])

c2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)

print(c2)

print ()

#percentage distribution for alcohol consumption rate

print (“Percentage table for alcohol consumption”)

p2 = sub2[‘alcoholSplit’].value_counts(sort=False, dropna=False)*100/len(sub2)

print(p2)

print ()

#next look at polity score

print (“Polity (democracy) score”)

print (‘————————-‘)

#describe categories from the Polity IV Project

print(“-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy”)

print ()

#find lowest value

minPolity = min(sub2.polityscore)

print (‘Minimum:’,minPolity )

#find highest value

maxPolity = max(sub2.polityscore)

print (‘Maximum:’,maxPolity )

#find median

medianPolity = numpy.median(sub2.polityscore)

print (‘Average (median):’,medianPolity)

print ()

##display counts and percentages for polity score categories

#also show blank (NaN) values

print (‘Frequency table for polity score’)

sub2[‘politySplit’] = pandas.cut(sub2.polityscore, [-11, -6, 0, 5, 9, 10])

c3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)

print(c3)

print ()

#percentage distribution for polity score

print (“Percentage table for polity score”)

p3 = sub2[‘politySplit’].value_counts(sort=False, dropna=False)*100/len(sub2)

print(p3)

print ()

#next look at regions

print (“Geographical regions”)

print (‘————————-‘)

#describe categories

print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)

print()

print(‘Frequency table for region’)

c4 = sub2[‘region’].value_counts(sort=False, dropna=False)

print (c4)

print ()

print (“Percentage table for region”)

p4 = sub2[‘region’].value_counts(sort=False)*100/len(sub2)

print (p4)

print ()

#Finally, look at suicide rates only in Asia

print (“Suicide rate per 100,000 population, age adjusted (Asia only)”)

print (‘————————-‘)

# get subset of values for Asia

sub3=sub2[sub2[‘region’]==1]

AsiaStats=sub3.copy()

#find lowest value

minSuicide = min(AsiaStats.suicideper100th)

print (‘Minimum:’,minSuicide )

#find highest value

maxSuicide = max(AsiaStats.suicideper100th)

print (‘Maximum:’,maxSuicide )

#find median

medianSuicide = numpy.median(AsiaStats.suicideper100th)

print (‘Average (median):’,medianSuicide)

print ()

# categorize suicide rate based on customized splits using cut function

# put the suicide rate into “bins” for values between 0 and 40

print (‘Frequency table for suicide rate (Asia only)’)

AsiaStats[‘suicideSplit’] = pandas.cut(AsiaStats.suicideper100th, [0, 5, 10, 15, 20, 25, 30, 35, 40 ])

c5 = AsiaStats[‘suicideSplit’].value_counts(sort=False, dropna=True)

print(c5)

print ()

#percentage distribution for suicide rate

print (“Percentage table for suicide rate (Asia only)”)

p5 = AsiaStats[‘suicideSplit’].value_counts(sort=False)*100/len(AsiaStats)

print(p5)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

First, I got a subset of the Gapminder data set, including only those countries that had a value for the suicide rate:

Number of countries:

191

The first variable I looked at was the suicide rate:

Suicide rate per 100,000 population, age adjusted

————————-

Minimum: 0.2014487237

Maximum: 35.752872467

Average (median): 8.2628927231

Frequency table for suicide rate

(0, 5] 49

(5, 10] 65

(10, 15] 51

(15, 20] 12

(20, 25] 6

(25, 30] 6

(30, 35] 1

(35, 40] 1

dtype: int64

Percentage table for suicide rate

(0, 5] 25.654450

(5, 10] 34.031414

(10, 15] 26.701571

(15, 20] 6.282723

(20, 25] 3.141361

(25, 30] 3.141361

(30, 35] 0.523560

(35, 40] 0.523560

dtype: float64

The lowest reported suicide rate is 0.2 (to 2 decimal places) and the highest is 35.75. The median value is 8.26.

Because the suicide rate is a continuous variable (a typical value is something like “4.2170763016”), it is useful to put these values into bins with a range of 5 before running a frequency distribution. As we can see, the most common range is between 5 and 10 reported suicides per 100000 of the population. 65 countries, or 34.03% of the total, have between 5 and 10 suicides per 100000 of the population. Only two countries have more than 30 reported suicides per 100000 of the population. There are no “not a number”(NaN) values, because I have already stripped out any countries without a recorded suicide rate.

I next looked at the alcohol consumption variable:

Alcohol consumption per adult (age 15+), in litres

————————-

Minimum: 0.03

Maximum: 23.01

Average (median): 6.16

Frequency table for alcohol consumption

(0, 2.5] 45

(2.5, 5] 35

(5, 7.5] 31

(7.5, 10] 30

(10, 12.5] 20

(12.5, 15] 13

(15, 17.5] 8

(17.5, 20] 2

(20, 22.5] 0

(22.5, 25] 1

NaN 6

dtype: int64

Percentage table for alcohol consumption

(0, 2.5] 23.560209

(2.5, 5] 18.324607

(5, 7.5] 16.230366

(7.5, 10] 15.706806

(10, 12.5] 10.471204

(12.5, 15] 6.806283

(15, 17.5] 4.188482

(17.5, 20] 1.047120

(20, 22.5] 0.000000

(22.5, 25] 0.523560

NaN 3.141361

dtype: float64

The lowest alcohol consumption rate is 0.03 litres per adult, and the highest is 23.01. The median value is 6.16 litres.

Again, the alcohol consumption figure is a continuous variable, so I put these values into bins with a range of 2.5 litres. As we can see, the most common range is between 0 and 2.5 litres. In 45 countries, or 23.56% of the total, each person over the age of 15 drank an average of between 0 and 2.5 litres of pure alcohol. Only one country has an alcohol consumption rate of over 20 litres. No data is available for 6 countries (2.81%).

I next created a frequency table for the polity (democracy) score:

Polity (democracy) score

————————-

-10 to -6 = Autocracy, -5 to 0 = Closed Anocracy, 1 to 5 = Open Anocracy, 6 to 10 = Democracy, 10 = Full Democracy

Minimum: -10.0

Maximum: 10.0

Average (median): 8.0

Frequency table for polity score

(-11, -6] 23

(-6, 0] 29

(0, 5] 19

(5, 9] 56

(9, 10] 32

NaN 32

dtype: int64

Percentage table for polity score

(-11, -6] 12.041885

(-6, 0] 15.183246

(0, 5] 9.947644

(5, 9] 29.319372

(9, 10] 16.753927

NaN 16.753927

dtype: float64

The polity score is based on the 2009 Polity IV Project, which is sponsored by the Political Instability Task Force (PITF). PITF defines countries as democracies, anocracies, or autocracies depending on this score (see http://www.systemicpeace.org/polity/polity4x.htm). I thought it useful to bin the polity scores to correspond with the PITF categories.

We can see that 32 countries, or 16.75% of the total, score as Full Democracies. The most common category is Democracy, with 56 countries or 29.32% of the total. 19 countries (9.95%) scored as Open Anocracies, 29 (15.18%) as Closed Autocracies, and 23 (12.04%) as Autocracies. No data was available for 32 countries (16.75%).

I next created a frequency table for region:

Geographical regions

————————-

Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7

Frequency table for region

1 31

2 46

3 49

4 16

5 22

6 12

7 15

dtype: int64

Percentage table for region

1 16.230366

2 24.083770

3 25.654450

4 8.376963

5 11.518325

6 6.282723

7 7.853403

dtype: float64

Because these are “dummy” values, there’s no need to calculate medians or to put values into bins.

The “region” variable did not exist in the original Gapminder data set, but I created it as a potentially useful category for grouping countries. As an example of this, here is a frequency distribution of suicide rates for only Asian countries:

Suicide rate per 100,000 population, age adjusted (Asia only)

————————-

Minimum: 1.3809646368

Maximum: 28.1040458679

Average (median): 11.3961114883

Frequency table for suicide rate (Asia only)

(0, 5] 6

(5, 10] 8

(10, 15] 8

(15, 20] 4

(20, 25] 2

(25, 30] 3

(30, 35] 0

(35, 40] 0

dtype: int64

Percentage table for suicide rate (Asia only)

(0, 5] 19.354839

(5, 10] 25.806452

(10, 15] 25.806452

(15, 20] 12.903226

(20, 25] 6.451613

(25, 30] 9.677419

(30, 35] 0.000000

(35, 40] 0.000000

dtype: float64

We can see some differences between the Asian and the worldwide data. Asian countries have a median of 11.4 suicides per 100000 of the population, much higher than the worldwide median of 8.26. The frequency distribution is also different; a lower percentage of Asian countries have a suicide rate under 10 per 100000, while a much higher percentage of Asian countries are in the 15 to 30 categories. However, no Asian country has a suicide rate of over 30 per 100000.

A similar breakdown could be done for other regions or other variables.