Reading arguments¶

Watch it

See the accompanied youtube video at the link here.

Arguments¶

When we load in our data we use different arguments to make sure it’s organized how we want it.

delimiter is an argument we have already discussed that instructs on how to separate each value in the data.

This is only the tip of the iceberg.

There are many others that are helpful when reading in our data, such as index_col, header, nrows, and usecols.

Here, we are going to introduce different arguments for pd.read_csv() and pd.read_excel():

index_col
header
nrows
usecols

If you wish to know more, you can find the documentation at the following links:

index_col¶

index_col is an argument that indicates which column will be acting as the index label.

In most of the cases we have encountered, we did not use this argument and instead relied on the pandas default, which is to use ascending integers for the index.

We can, however, specify a column in the data to become the index.

It’s in our best interest that the column we choose have unique values.

For our cereal.csv let’s specify the name column as our index.

df = pd.read_csv('cereal.csv', index_col="name")
df.head()

	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
name
100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843

The index_col argument also take in positions.

The name column in our data is in the 0th position so we can also specify the index like we show here with index_col=0.

df = pd.read_csv('cereal.csv', index_col=0)
df.head()

	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
name
100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843

header¶

We have been lucky up until now that all the data we have loaded in has been particularly straightforward.

Sometimes with data, there are a few lines of text explaining important points about the file.

We do not want to include this in our dataframe and therefore we need to specify exactly when our dataframe begins.

This is where header comes in.

Take a look at candybars-h.csv below as an example.

If we look at the data with a regular text editor, the data doesn’t start until the 3rd line which would be the equivalent of position 2 (since we begin counting from 0).

If we load this dataset without any arguments, we get this as the output.

candybars = pd.read_csv('candybars-h.csv')
candybars

	This dataset was created by Hayley Boyce in February 2020.	Unnamed: 1	Unnamed: 2	Unnamed: 3	Unnamed: 4	Unnamed: 5	Unnamed: 6	Unnamed: 7	Unnamed: 8	Unnamed: 9	Unnamed: 10
0	Note this is not a complete dataset and there ...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	name	weight	chocolate	peanuts	caramel	nougat	cookie_wafer_rice	coconut	white_chocolate	multi	available_canada_america
2	Coffee Crisp	50	1	0	0	0	1	0	0	0	Canada
3	Butterfinger	184	1	1	1	0	0	0	0	0	America
4	Skor	39	1	0	1	0	0	0	0	0	Both
...	...	...	...	...	...	...	...	...	...	...	...
22	Take 5	43	1	1	1	0	1	0	0	0	America
23	Whatchamacallits	45	1	1	0	0	1	0	0	0	America
24	Almond Joy	46	1	0	0	0	0	1	0	0	America
25	Oh Henry	51	1	1	1	0	0	0	0	0	Both
26	Cookies and Cream	43	0	0	0	0	1	0	1	0	Both

27 rows × 11 columns

We see that there are no clear column names and things are quite a mess!

If we use header=2 to indicate that the data actually begins at position 2, then things start to look much better.

candybars = pd.read_csv('candybars-h.csv', header=2)
candybars

	name	weight	chocolate	peanuts	caramel	nougat	cookie_wafer_rice	coconut	white_chocolate	multi	available_canada_america
0	Coffee Crisp	50	1	0	0	0	1	0	0	0	Canada
1	Butterfinger	184	1	1	1	0	0	0	0	0	America
2	Skor	39	1	0	1	0	0	0	0	0	Both
3	Smarties	45	1	0	0	0	0	0	0	1	Canada
4	Twix	58	1	0	1	0	1	0	0	1	Both
...	...	...	...	...	...	...	...	...	...	...	...
20	Take 5	43	1	1	1	0	1	0	0	0	America
21	Whatchamacallits	45	1	1	0	0	1	0	0	0	America
22	Almond Joy	46	1	0	0	0	0	1	0	0	America
23	Oh Henry	51	1	1	1	0	0	0	0	0	Both
24	Cookies and Cream	43	0	0	0	0	1	0	1	0	Both

25 rows × 11 columns

nrows¶

nrows is an argument in pd.read_csv() that is useful when you only want to load in part of the dataframe.

Perhaps the file you have is large and you only want a sample of it.

nrows will limit the number of rows that you read in.

candybars = pd.read_csv('candybars.csv', nrows=7)
candybars

	name	weight	chocolate	peanuts	caramel	nougat	cookie_wafer_rice	multi	available_canada_america
0	Coffee Crisp	50	1	0	0	0	1	0	Canada
1	Butterfinger	184	1	1	1	0	0	0	America
2	Skor	39	1	0	1	0	0	0	Both
3	Smarties	45	1	0	0	0	0	1	Canada
4	Twix	58	1	0	1	0	1	1	Both
5	Reeses Peanutbutter Cups	43	1	1	0	0	0	1	Both
6	3 Musketeers	54	1	0	0	1	0	0	America

This code loads in only the first 7 rows of our candybar dataset.

usecols¶

Similarly to how nrows specifies how many rows to read in, usecols selects which columns to load from the data.

Perhaps the only columns relevant to our analysis are the columns name, weight and available_canada_america.

We can forgo the other columns when reading the data in.

In a similar way to selecting columns using .iloc[], we specify the desired column indices in square brackets.

candybars = pd.read_csv('candybars.csv', usecols=[0, 1, 10])
candybars

	name	weight	available_canada_america
0	Coffee Crisp	50	Canada
1	Butterfinger	184	America
2	Skor	39	Both
3	Smarties	45	Canada
4	Twix	58	Both
...	...	...	...
20	Take 5	43	America
21	Whatchamacallits	45	America
22	Almond Joy	46	America
23	Oh Henry	51	Both
24	Cookies and Cream	43	Both

25 rows × 3 columns

The usecols argument accepts either index positions or labels so we could also use the column names in square brackets as shown here.

candybars = pd.read_csv('candybars.csv', usecols=['name', 'weight', 'available_canada_america'])
candybars

	name	weight	available_canada_america
0	Coffee Crisp	50	Canada
1	Butterfinger	184	America
2	Skor	39	Both
3	Smarties	45	Canada
4	Twix	58	Both
...	...	...	...
20	Take 5	43	America
21	Whatchamacallits	45	America
22	Almond Joy	46	America
23	Oh Henry	51	Both
24	Cookies and Cream	43	Both

25 rows × 3 columns

Let’s apply what we learned!

1.Which argument will assign the index when reading in your data with pd.read_excel()?
a) header
b) ncols
c) index_col

2. Which argument will select only specific columns of the data file with pd.read_csv()?
a) header
b) nrows
c) usecols

Solutions!

c) index_col
c) usecols

Programming in Python for Data Science