Reading arguments

Watch it

See the accompanied youtube video at the link here.

Arguments

When we load in our data we use different arguments to make sure it’s organized how we want it.

delimiter is an argument we have already discussed that instructs on how to separate each value in the data.

This is only the tip of the iceberg.

There are many others that are helpful when reading in our data, such as index_col, header, nrows, and usecols.

Here, we are going to introduce different arguments for pd.read_csv() and pd.read_excel():

  • index_col

  • header

  • nrows

  • usecols

If you wish to know more, you can find the documentation at the following links:

index_col

index_col is an argument that indicates which column will be acting as the index label.

In most of the cases we have encountered, we did not use this argument and instead relied on the pandas default, which is to use ascending integers for the index.

We can, however, specify a column in the data to become the index.

It’s in our best interest that the column we choose have unique values.

For our cereal.csv let’s specify the name column as our index.

df = pd.read_csv('cereal.csv', index_col="name")
df.head()
mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
name
100% Bran N Cold 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
100% Natural Bran Q Cold 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
All-Bran K Cold 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
All-Bran with Extra Fiber K Cold 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
Almond Delight R Cold 110 2 2 200 1.0 14.0 8 1 25 3 1.0 0.75 34.384843

The index_col argument also take in positions.

The name column in our data is in the 0th position so we can also specify the index like we show here with index_col=0.

df = pd.read_csv('cereal.csv', index_col=0)
df.head()
mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
name
100% Bran N Cold 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
100% Natural Bran Q Cold 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
All-Bran K Cold 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
All-Bran with Extra Fiber K Cold 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
Almond Delight R Cold 110 2 2 200 1.0 14.0 8 1 25 3 1.0 0.75 34.384843

nrows

nrows is an argument in pd.read_csv() that is useful when you only want to load in part of the dataframe.

Perhaps the file you have is large and you only want a sample of it.

nrows will limit the number of rows that you read in.

candybars = pd.read_csv('candybars.csv', nrows=7)
candybars
name weight chocolate peanuts caramel nougat cookie_wafer_rice coconut white_chocolate multi available_canada_america
0 Coffee Crisp 50 1 0 0 0 1 0 0 0 Canada
1 Butterfinger 184 1 1 1 0 0 0 0 0 America
2 Skor 39 1 0 1 0 0 0 0 0 Both
3 Smarties 45 1 0 0 0 0 0 0 1 Canada
4 Twix 58 1 0 1 0 1 0 0 1 Both
5 Reeses Peanutbutter Cups 43 1 1 0 0 0 0 0 1 Both
6 3 Musketeers 54 1 0 0 1 0 0 0 0 America

This code loads in only the first 7 rows of our candybar dataset.

usecols

Similarly to how nrows specifies how many rows to read in, usecols selects which columns to load from the data.

Perhaps the only columns relevant to our analysis are the columns name, weight and available_canada_america.

We can forgo the other columns when reading the data in.

In a similar way to selecting columns using .iloc[], we specify the desired column indices in square brackets.

candybars = pd.read_csv('candybars.csv', usecols=[0, 1, 10])
candybars
name weight available_canada_america
0 Coffee Crisp 50 Canada
1 Butterfinger 184 America
2 Skor 39 Both
3 Smarties 45 Canada
4 Twix 58 Both
... ... ... ...
20 Take 5 43 America
21 Whatchamacallits 45 America
22 Almond Joy 46 America
23 Oh Henry 51 Both
24 Cookies and Cream 43 Both

25 rows × 3 columns

The usecols argument accepts either index positions or labels so we could also use the column names in square brackets as shown here.

candybars = pd.read_csv('candybars.csv', usecols=['name', 'weight', 'available_canada_america'])
candybars
name weight available_canada_america
0 Coffee Crisp 50 Canada
1 Butterfinger 184 America
2 Skor 39 Both
3 Smarties 45 Canada
4 Twix 58 Both
... ... ... ...
20 Take 5 43 America
21 Whatchamacallits 45 America
22 Almond Joy 46 America
23 Oh Henry 51 Both
24 Cookies and Cream 43 Both

25 rows × 3 columns

Let’s apply what we learned!

1.Which argument will assign the index when reading in your data with pd.read_excel()?
a) header
b) ncols
c) index_col

2. Which argument will select only specific columns of the data file with pd.read_csv()?
a) header
b) nrows
c) usecols