Reading arguments¶
Watch it
See the accompanied youtube video at the link here.
Arguments¶
When we load in our data we use different arguments to make sure it’s organized how we want it.
delimiter
is an argument we have already discussed that instructs on
how to separate each value in the data.
This is only the tip of the iceberg.
There are many others that are helpful when reading in our data, such as
index_col
, header
, nrows
, and usecols
.
Here, we are going to introduce different arguments for pd.read_csv()
and pd.read_excel()
:
index_col
header
nrows
usecols
If you wish to know more, you can find the documentation at the following links:
index_col¶
index_col
is an argument that indicates which column will be
acting as the index label.
In most of the cases we have encountered, we did not use this argument and instead relied on the pandas default, which is to use ascending integers for the index.
We can, however, specify a column in the data to become the index.
It’s in our best interest that the column we choose have unique values.
For our cereal.csv
let’s specify the name
column as our index.
df = pd.read_csv('cereal.csv', index_col="name")
df.head()
mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
100% Bran | N | Cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
All-Bran | K | Cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
Almond Delight | R | Cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | 1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
The index_col
argument also take in positions.
The name
column in our data is in the 0th position so we can also
specify the index like we show here with index_col=0
.
df = pd.read_csv('cereal.csv', index_col=0)
df.head()
mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||||||
100% Bran | N | Cold | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
All-Bran | K | Cold | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
Almond Delight | R | Cold | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | 1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
header¶
We have been lucky up until now that all the data we have loaded in has been particularly straightforward.
Sometimes with data, there are a few lines of text explaining important points about the file.
We do not want to include this in our dataframe and therefore we need to specify exactly when our dataframe begins.
This is where header
comes in.
Take a look at
candybars-h.csv
below as an example.

If we look at the data with a regular text editor, the data doesn’t start until the 3rd line which would be the equivalent of position 2 (since we begin counting from 0).
If we load this dataset without any arguments, we get this as the output.
candybars = pd.read_csv('candybars-h.csv')
candybars
This dataset was created by Hayley Boyce in February 2020. | Unnamed: 1 | Unnamed: 2 | Unnamed: 3 | Unnamed: 4 | Unnamed: 5 | Unnamed: 6 | Unnamed: 7 | Unnamed: 8 | Unnamed: 9 | Unnamed: 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Note this is not a complete dataset and there ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | name | weight | chocolate | peanuts | caramel | nougat | cookie_wafer_rice | coconut | white_chocolate | multi | available_canada_america |
2 | Coffee Crisp | 50 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Canada |
3 | Butterfinger | 184 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | America |
4 | Skor | 39 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | Both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
22 | Take 5 | 43 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | America |
23 | Whatchamacallits | 45 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | America |
24 | Almond Joy | 46 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | America |
25 | Oh Henry | 51 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | Both |
26 | Cookies and Cream | 43 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | Both |
27 rows × 11 columns
We see that there are no clear column names and things are quite a mess!
If we use header=2
to indicate that the data actually begins at
position 2, then things start to look much better.
candybars = pd.read_csv('candybars-h.csv', header=2)
candybars
name | weight | chocolate | peanuts | caramel | nougat | cookie_wafer_rice | coconut | white_chocolate | multi | available_canada_america | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Coffee Crisp | 50 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Canada |
1 | Butterfinger | 184 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | America |
2 | Skor | 39 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | Both |
3 | Smarties | 45 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Canada |
4 | Twix | 58 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | Both |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20 | Take 5 | 43 | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | America |
21 | Whatchamacallits | 45 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | America |
22 | Almond Joy | 46 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | America |
23 | Oh Henry | 51 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | Both |
24 | Cookies and Cream | 43 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | Both |
25 rows × 11 columns
nrows¶
nrows
is an argument in pd.read_csv()
that is useful when you only
want to load in part of the dataframe.
Perhaps the file you have is large and you only want a sample of it.
nrows
will limit the number of rows that you read in.
candybars = pd.read_csv('candybars.csv', nrows=7)
candybars
name | weight | chocolate | peanuts | caramel | nougat | cookie_wafer_rice | coconut | white_chocolate | multi | available_canada_america | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Coffee Crisp | 50 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | Canada |
1 | Butterfinger | 184 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | America |
2 | Skor | 39 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | Both |
3 | Smarties | 45 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | Canada |
4 | Twix | 58 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | Both |
5 | Reeses Peanutbutter Cups | 43 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | Both |
6 | 3 Musketeers | 54 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | America |
This code loads in only the first 7 rows of our candybar dataset.
usecols¶
Similarly to how nrows
specifies how many rows to read in, usecols
selects which columns to load from the data.
Perhaps the only columns relevant to our analysis are the columns
name
, weight
and available_canada_america
.
We can forgo the other columns when reading the data in.
In a similar way to selecting columns using .iloc[]
, we specify the
desired column indices in square brackets.
candybars = pd.read_csv('candybars.csv', usecols=[0, 1, 10])
candybars
name | weight | available_canada_america | |
---|---|---|---|
0 | Coffee Crisp | 50 | Canada |
1 | Butterfinger | 184 | America |
2 | Skor | 39 | Both |
3 | Smarties | 45 | Canada |
4 | Twix | 58 | Both |
... | ... | ... | ... |
20 | Take 5 | 43 | America |
21 | Whatchamacallits | 45 | America |
22 | Almond Joy | 46 | America |
23 | Oh Henry | 51 | Both |
24 | Cookies and Cream | 43 | Both |
25 rows × 3 columns
The usecols
argument accepts either index positions or labels so we
could also use the column names in square brackets as shown here.
candybars = pd.read_csv('candybars.csv', usecols=['name', 'weight', 'available_canada_america'])
candybars
name | weight | available_canada_america | |
---|---|---|---|
0 | Coffee Crisp | 50 | Canada |
1 | Butterfinger | 184 | America |
2 | Skor | 39 | Both |
3 | Smarties | 45 | Canada |
4 | Twix | 58 | Both |
... | ... | ... | ... |
20 | Take 5 | 43 | America |
21 | Whatchamacallits | 45 | America |
22 | Almond Joy | 46 | America |
23 | Oh Henry | 51 | Both |
24 | Cookies and Cream | 43 | Both |
25 rows × 3 columns
Let’s apply what we learned!
1.Which argument will assign the index when reading in your data with pd.read_excel()
?
a) header
b) ncols
c) index_col
2. Which argument will select only specific columns of the data file with pd.read_csv()
?
a) header
b) nrows
c) usecols
Solutions!
c)
index_col
c)
usecols