How To Access the Raw Data

How To Access the Raw Data

All population data used to calculate the Birth Gap is freely available from Eurostat, the statistical office of the European Union. Eurostat collect population data from all European Union members, as well as seven additional states (Iceland, Liechtenstein, Macedonia, Montenegro, Norway, Switzerland and Turkey). The data is published at national level and at a series of regional levels known as “NUTS”. More information can be found on the Eurostat website here.

The file with the  number of people by region and age is  ‘demo_r_d2jan’ and can be accessed by anyone here. It is the first file in the ‘regional’ section. A version with national data only is named ‘demo_jan’ and is available using the same link; it is the first file in the main ‘Population’ section.

On downloading those files, you will need to use a tool to unzip the files which are in ‘gzip’ format. There are a number of free software tools that can be downloaded to do that for Windows and Mac OS X.  Finally, you will have to work with a slightly cumbersome file format which has two different methods of separating columns. One alternative is to use a coding language such as Python, and use a script like the one provided below.

What you will have access to is the age of everyone in the EU, plus seven other countries, down to  NUTS2 regional level, for every year since 1990. The ‘age’ column has rows labeled with a ‘Y’ followed by age next birthday, so, ‘Y51’ means people due to turn 51 next birthday (the group used in the Birth Gap measure).  Similarly ‘Y1’ refers to those currently under one year old. The data is also broken out by gender, but for this particular exercise, you probably want to look at the ‘Total’ of males and females, which is listed simply as a gender of ‘T’.  Finally, a lookup file  is available here to translate the NUTS codes into meaningful region names.

With that, you should be able to create your own Birth Gap metric for any NUTS region from 1990 onwards to match the data on this site.

Note the data used to calculate Birth Gap for this site was reviewed as being accurate on June 14th 2016.  Data for all countries and regions matched exactly with the sole exception of Austria which had changed by a very small margin (a hundred people or so), due to a revision of the data for that country.


Using Python coding language to read the file and arrange its columns

For those of you able to use a coding language like Python, or who know someone who can, the following code snippet may help.  The ‘f_in’ variable is the path and filename of the downloaded file, and ‘f_out’ is the name of the desired ‘clean’ cvs file to output.   The code snippet works on Python 3.x and may work on 2.x.  (Note, indenting is not possible on this site, so the appropriate lines than need indenting are commented below.)


# import and convert of raw Eurostat file to a clean CSV file

import pandas

df = pandas.read_csv(f_in,compression=’gzip’, header=0, delimiter = delim, na_values=[“:”,”: “])
df = df.rename(columns=lambda x: x.strip())

df[‘gender’] = df[‘sex,age,geo\\time’].apply(lambda x: x.split(“,”)[0])
df[‘age’] = df[‘sex,age,geo\\time’].apply(lambda x: x.split(“,”)[1])
df[‘geo’] = df[‘sex,age,geo\\time’].apply(lambda x: x.split(“,”)[2])

df = df.fillna(0)

“”” remove spuriouys characters – (non numeric notes etc) “””
for i in range(1990,2016):

“”” indent three rows below “””

colname = str(i)
df[colname] = df[colname].astype(‘str’)
df[colname] = df[colname].apply(lambda x: re.sub(“[^0-9]”, “”,x))

“”” write cvs file”””