Pandas dataframes
Reading tabular data into dataframes
First, let’s download the data, if you have not done this already in the previous chapter. Open a terminal inside your Jupyter dashboard. Inside the terminal, type:
wget http://bit.ly/pythfiles -O pfiles.zip
unzip pfiles.zip && rm pfiles.zip # this should unpack into the directory data-python/
You can now close the terminal panel. Let’s switch back to our Python notebook and check our location:
%pwd # run `pwd` bash command
%ls # make sure you see data-python/
Pandas is a widely-used Python library for working with tabular data, borrows heavily from R’s dataframes, built on top of numpy. We will be reading the data we downloaded a minute ago into a pandas dataframe:
import pandas as pd
data = pd.read_csv('data-python/gapminder_gdp_oceania.csv')
print(data)
data # this prints out the table nicely in Jupyter Notebook!
data.shape # shape is a *member variable inside data*
data.info() # info is a *member method inside data*
Use dir(data) to list all member variables and methods. Then call that name without (), and if it’s a method it’ll tell you, so you’ll need to use ().
Rows are observations, and columns are the observed variables. You can add new observations at any time.
Currently the rows are indexed by number. Let’s index by country:
data = pd.read_csv('data-python/gapminder_gdp_oceania.csv', index_col='country')
data
data.shape # now 12 columns
data.info() # it's a dataframe! show row/column names, precision, memory usage
print(data.columns) # will list all the columns
print(data.T) # this will transpose the dataframe; curously this is a variable
data.describe() # will print some statistics of numerical columns (very useful for 1000s of rows!)
Question 12a
Quick question: how would you list all country names?
Hint: try data.T.columns
Question 12b
Read the data ingapminder_gdp_americas.csv
(which should be in the same directory as gapminder_gdp_oceania.csv
)
into a variable called americas
and display its summary statistics.
Question 13
Write a command to display the first three rows of theamericas
data frame. What about the last three columns of this
data frame?
Question 14
The data for your current project is stored in a file called microbes.csv
, which is located in a folder called
field_data
. You are doing analysis in a notebook called analysis.ipynb
in a sibling folder called thesis
:
your_home_directory/
+-- fieldData/
+-- microbes.csv
+-- thesis/
+-- analysis.ipynb
What value(s) should you pass to read_csv()
to read microbes.csv
in analysis.ipynb
?
Question 15
As well as theread_csv()
function for reading data from a file, Pandas provides a to_csv()
function to write data
frames to files. Applying what you’ve learned about reading from files, write one of your data frames to a file called
processed.csv
. You can use help to get information on how to use to_csv()
.
Subsetting
data = pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country')
data.head()
Let’s rename the first column:
data.rename(columns={'gdpPercap_1952': 'y1952'}) # this renames only one but does not change `data`
Note: we could also name the column ‘1952’, but some Pandas operations don’t work with purely numerical column names.
Let’s go through all columns and assign the new names:
for col in data.columns:
print(col, col[-4:])
data = data.rename(columns={col: 'y'+col[-4:]})
data
Pandas lets you subset elements using either their numerical indices or their row/column names. Long time ago Pandas
used to have a single function to do both. Now there are two separate functions, iloc()
and loc()
. Let’s print one
element:
data.iloc[0,0] # the very first element by position
data.loc['Albania','y1952'] # exactly the same; the very first element by label
Printing a row:
data.loc['Albania',:] # usual Python's slicing notation - show all columns in that row
data.loc['Albania'] # exactly the same
data.loc['Albania',] # exactly the same
Printing a column:
data.loc[:,'y1952'] # show all rows in that column
data['y1952'] # exactly the same; single index refers to columns
data.y1952 # most compact notation; does not work with numerical-only names
Printing a range:
data.loc['Italy':'Poland','y1952':'y1967'] # select multiple rows/columns
data.iloc[0:2,0:3]
Result of slicing can be used in further operations:
data.loc['Italy':'Poland','y1952':'y1967'].max() # max for each column
data.loc['Italy':'Poland','y1952':'y1967'].min() # min for each column
Use comparisons to select data based on value:
subset = data.loc['Italy':'Poland', 'y1962':'y1972']
print(subset)
print(subset > 1e4)
Use a Boolean mask to print values (meeting the condition) or NaN (not meeting the condition):
mask = (subset > 1e4)
print(mask)
print(subset[mask]) # will print numerical values only if the corresponding elements in mask are True
NaN’s are ignored by statistical operations which is handy:
subset[mask].describe()
subset[mask].max()
Question 16
Assume Pandas has been imported into your notebook and the Gapminder GDP data for Europe has been loaded:
df = pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country')
Write an expression to find the per capita GDP of Serbia in 2007.
Question 17
Explain what each line in the following short program does, e.g. what is in the variables first
, second
, …:
first = pd.read_csv('data-python/gapminder_all.csv', index_col='country')
second = first[first['continent'] == 'Americas']
third = second.drop('Puerto Rico')
fourth = third.drop('continent', axis = 1)
fourth.to_csv('result.csv')
Question 18
Explain in simple terms what idxmin()
and idxmax()
do in the short program below. When would you use these methods?
data = pd.read_csv('data-python/gapminder_gdp_europe.csv', index_col='country')
print(data.idxmin())
print(data.idxmax())
How do you create a dataframe from scratch? Many ways; the easiest by defining columns:
col1 = [1,2,3]
col2 = [4,5,6]
pd.DataFrame({'a': col1, 'b': col2}) # dataframe from a dictionary
Let’s index the rows by hand:
pd.DataFrame({'a': col1, 'b': col2}, index=['a1','a2','a3'])
Looping over data sets
Let’s say we want to read several files in data-python/. We can use for to loop through their list:
for filename in ['data-python/gapminder_gdp_africa.csv', 'data-python/gapminder_gdp_asia.csv']:
data = pd.read_csv(filename, index_col='country')
print(filename, data.min()) # print min for each column
If we have many (10s or 100s) files, we want to specify them with a pattern:
from glob import glob
print('all csv files in data-python:', glob('data-python/*.csv')) # returns a list
print('all text files in data-python:', glob('data-python/*.txt')) # empty list
list = glob('data-python/*.csv')
len(list)
for filename in glob('data-python/gapminder*.csv'):
data = pd.read_csv(filename)
print(filename, data.gdpPercap_1952.min())
Question 19
Which of these files is not matched by the expression glob('data/*as*.csv')
?
A. data/gapminder_gdp_africa.csv
B. data/gapminder_gdp_americas.csv
C. data/gapminder_gdp_asia.csv
D. 1 and 2 are not matched
Question 20
Modify this program so that it prints the number of records in the file that has the fewest records.
fewest = ____
for filename in glob('data/*.csv'):
fewest = ____
print('smallest file has', fewest, 'records')