Running my first Python program

This is part of my coursework for Data Management and Visualization.

I am using Python to analyse the data available from Gapminder.

Here is my Python program:

#import data analysis package
import pandas

#import the entire data set to memory
data = pandas.read_csv(‘mynewgapminder.csv’, low_memory=False)

print (“Number of countries:”)
print(len(data)) # Number of observations (rows)
print (“Number of variables:”)
print(len(data.columns)) # Number of variables (columns)

#ensure that variables are numeric
data[‘suicideper100th’] = data[‘suicideper100th’].convert_objects(convert_numeric=True)
data[‘polityscore’] = data[‘polityscore’].convert_objects(convert_numeric=True)
data[‘region’] = data[‘region’].convert_objects(convert_numeric=True)

#display counts and percentages for suicide rates
print (“counts for suicide rate per 100,000 population, age adjusted”)
ct1 = data.groupby(‘suicideper100th’).size()
print (ct1)
print (“percentages for suicide rate per 100,000 population, age adjusted”)
pt1 = data.groupby(‘suicideper100th’).size()*100/len(data)
print (pt1)

#I only want to look at countries where stats exist for suicide rate
#get subset where suicide rates exist
sub1=data[data[‘suicideper100th’]>0]
sub2=sub1.copy()

#Total number of countries where suicide stats exist
print (“Total number of countries where suicide stats exist”)
print(len(sub2))

#display counts and percentages for polity score
print (“counts for polity score”)
ct2 = sub2.groupby(‘polityscore’).size()
print (ct2)
print (“percentages for polity score”)
pt2 = sub2.groupby(‘polityscore’).size()*100/len(data)
print (pt2)

#display counts and percentages for regions
print(‘counts for region’)
print (“Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7”)
ct3 = sub2.groupby(‘region’).size()
print (ct3)
print (“percentages for region”)
pt3 = sub2.groupby(‘region’).size()*100/len(data)
print (pt3)

When this program is run, it prints content to the IPython console, as described in the following paragraphs.

I first looked at the number of countries and variables:

Number of countries:
213
Number of variables:
17

The columns include all the countries from the Gapminder data set. The rows include the 16 variables from the Gapminder data set, plus an extra variable called “region”.

I then produced frequency rate for three variables. Because my research project is looking at factors that influence suicide, the first variable is suicideper100th:

counts for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 1
0.523528 1
1.370002 1
1.380965 1
1.392951 1
1.498057 1
1.519248 1
1.574350 1
1.658908 1
1.799904 1
1.922485 1
2.034178 1
2.109414 1
2.161843 1
2.206169 1
2.234896 1
2.515721 1
2.648981 1
2.816705 1
3.108603 1
3.146814 1
3.374416 1
3.563325 1
3.576478 1
3.716739 1
3.741588 1
3.940259 1
4.079525 1
4.119620 1
4.217076 1
..
14.554677 1
14.680936 1
14.713020 1
14.776250 1
15.538490 1
15.542603 1
15.714571 1
15.953850 1
16.234370 1
16.913248 1
16.959240 1
17.032646 1
18.583826 1
18.946930 1
18.954570 1
19.422610 1
20.162010 1
20.317930 1
20.369590 1
20.747431 1
22.353479 1
22.404560 1
25.404600 1
26.219198 1
26.874690 1
27.874160 1
28.104046 1
29.864164 1
33.341860 1
35.752872 1
dtype: int64
percentages for suicide rate per 100,000 population, age adjusted
suicideper100th
0.201449 0.469484
0.523528 0.469484
1.370002 0.469484
1.380965 0.469484
1.392951 0.469484
1.498057 0.469484
1.519248 0.469484
1.574350 0.469484
1.658908 0.469484
1.799904 0.469484
1.922485 0.469484
2.034178 0.469484
2.109414 0.469484
2.161843 0.469484
2.206169 0.469484
2.234896 0.469484
2.515721 0.469484
2.648981 0.469484
2.816705 0.469484
3.108603 0.469484
3.146814 0.469484
3.374416 0.469484
3.563325 0.469484
3.576478 0.469484
3.716739 0.469484
3.741588 0.469484
3.940259 0.469484
4.079525 0.469484
4.119620 0.469484
4.217076 0.469484

14.554677 0.469484
14.680936 0.469484
14.713020 0.469484
14.776250 0.469484
15.538490 0.469484
15.542603 0.469484
15.714571 0.469484
15.953850 0.469484
16.234370 0.469484
16.913248 0.469484
16.959240 0.469484
17.032646 0.469484
18.583826 0.469484
18.946930 0.469484
18.954570 0.469484
19.422610 0.469484
20.162010 0.469484
20.317930 0.469484
20.369590 0.469484
20.747431 0.469484
22.353479 0.469484
22.404560 0.469484
25.404600 0.469484
26.219198 0.469484
26.874690 0.469484
27.874160 0.469484
28.104046 0.469484
29.864164 0.469484
33.341860 0.469484
35.752872 0.469484
dtype: float64

The suicide rate is the mortality due to self-inflicted injury, per 100 000 standard population, adjusted for age. It is taken from a combination of time series from WHO Violence and Injury Prevention (VIP) and data from WHO Global Burden of Disease 2002 and 2004. [Further information about worldwide suicide rates can be found at the WHO web site.]

As you can see, the suicide rate is a continuous variable, so it is hard to see patterns at this stage. I may get a better picture at a later stage, when I learn to put continuous variables into ranges.

Suicide rates are not available for all countries in the Gapminder data set, so I decided to look only at those countries where the suicide rate was available:

Total number of countries where suicide stats exist
191

I next created a frequency table for polityscore, using the subset of countries:

counts for polity score
polityscore
-10 2
-9 4
-8 2
-7 12
-6 3
-5 2
-4 6
-3 6
-2 5
-1 4
0 6
1 3
2 3
3 2
4 4
5 7
6 10
7 13
8 19
9 14
10 32
dtype: int64
percentages for polity score
polityscore
-10 0.938967
-9 1.877934
-8 0.938967
-7 5.633803
-6 1.408451
-5 0.938967
-4 2.816901
-3 2.816901
-2 2.347418
-1 1.877934
0 2.816901
1 1.408451
2 1.408451
3 0.938967
4 1.877934
5 3.286385
6 4.694836
7 6.103286
8 8.920188
9 6.572770
10 15.023474
dtype: float64

The polity score, or democracy score, is based on the 2009 Polity IV Project. It is calculated by subtracting an autocracy score from a democracy score. The summary measure of a country’s democratic and free nature. -10 is the lowest value, and 10 the highest. [Further information about the Polity Project can be found at the Center for Systemic Peace web site.]

We can see that most countries scored highly on this measure. 32 countries, or 15% of the total, scored the highest possible score of 10, which means that they have open elections, checks on executive authority, and other measures of democracy. 23 countries (12% of the total) have a score below -5, which would make them autocracies.

I next created a frequency table for region, again using the subset of countries:

counts for region
Asia=1, Europe=2, Africa=3, Middle East=4, North and Central America=5,South America=6, Oceania=7
region
1 31
2 46
3 49
4 16
5 22
6 12
7 15
dtype: int64
percentages for region
region
1 14.553991
2 21.596244
3 23.004695
4 7.511737
5 10.328638
6 5.633803
7 7.042254
dtype: float64

The region variable did not exist in the original Gapminder data set, but I thought it would be useful to see patterns across the geographical regions of the world. I assigned a number to each region:

  1. Asia
  2. Europe
  3. Africa
  4. Middle East
  5. North and Central America
  6. South America
  7. Oceania

As we can see, Africa has the largest number of countries in this data set (49 countries, or 23% of the total), followed closely by Europe (46 countries, or 21.59% of the total). The smallest number of countries are found in South America (12, or 5.63% of the total).

Advertisements

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s