Introduction à Python II. Data processing

In this notebook, we learn the basics for data processing. Datas are usually stored in two dimensional arrays, such as an Excel spreadsheet (.xsls). We often use the format comma separated value (csv). In this format, data are simply separated by commas.

The Python library to load data is pandas. If the module is not installed on your computer, you may install it by typing on a terminal pip3 install pandas

A useful cheat-sheet can be found here ici.

1. Series

The first important class in Pandas is Series. A series is composed of

Compare for instance the following code:

As one can see, the xlabel has changed, and is much more meaningful when handling series. The following operations are possible:

WARNING: one cannot add two series... unless they have similar labels

Exercice: frequency of English letters

Let us try what we learned with a simple example. The file ./USconstitution.txt contains the American constitution.

Plot the frequency of each (lowercase) letter from a to z.

with open("namefile.txt", 'r') as file: #r is for read text = file.read()

It would be a good idea to sort the results before displaying them... Remember, internet is your friend here!

1.2 Dataframes

The second important type in pandas is DataFrames. It is a list of series, or rather:

For instance, one could compare the statistics of the letters for different languages.

Exercice: The file FRconstitution.txt contains the French constitution. Record the US series in freqUS, and create the corresponding French series in freqFR.

We can now create our DataFrame. First, we create a dictionnary (to record the "columns labels"), and we create our table as follows.

One can display the table by putting the variable name at the end of a cell (better than print here).

One can plot the information column by column with plot. Note how the labels are nicely taken into account.

Exercice: plot the previous statistic with "bars" (see documentation of plot)

Now, we would like to add the frequency for Spain (in the file ESconstitution.txt). We can add a column as follows (the syntax is similar to dictionnaries)

The labels of the lines are called index (as for Series), while the labels of the columns are called columns.

Exercice: The file unknownText.txt is written is an unknown language. Using the previous frequencies, have a guess (Hint: compute for instance the norm of | freqTXT - freqXX |...)

An concrete example

Let us find some real Data to practice. We can for instance get Covid data. The up-to-date data can be found on github ici (they are from Johns Hopkins University).

Exercice Find the .csv link for the raw data. We want the confirmed data in the world.

We then load the data with read_csv

As expected we obtain a DataFrame. Let us have a look at it (we only print the first 10 lines).

We can extract Series with simple commands. For instance, the list of Countries.

Let us clean this DataFrame (and learn throughout this process). For instance, assume we are only interested in the Data for France. We can extract a sub DataFrame as follows:

Let us do some more cleaning.

This is much better. Finally, we plot the results.

Exercice: Write a function get_series(nameCountry) which returns the (time)-serie for a given country name. The function should perform the sum of the Provinces (if any).

And we plot the results

Exercice: Display the "daily-case" for France. What do you think of this curve?

Exercice: Using the polyfit method, find the best polynomial of degree d which approximates the cumulated cases for France. Take $p \in [1, 2, 3, 4, 5, 6]$ and display the results with subplots.