Slicing Rows using .loc[]¶

Watch it

See the accompanied youtube video at the link here.

Congratulations on writing your first code!

We have read in our data, and we know the dimensions. Well, now what?

Let’s go over how we would index, slice, and select certain columns or rows of our data.

Let’s start by importing pandas and loading in a dataset named cereal.csv, and we will save it as cereal.

import pandas as pd
  
cereal = pd.read_csv('cereal.csv')
cereal.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843

Attribution:
“80 Cereals” (c) by Chris Crawford is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported.

We can see all the columns and the first 5 rows of the dataframe using .head()

However, let’s say we only want certain rows of the dataframe or certain columns.

We talked about how .head() will generate the first few rows of a dataframe (5 as default), but what if we wanted the rows from 5-10?

The first column of this dataframe is called the index.

Each row has a label (the index) as well as a position. In this case, the index label of an observation is the same as it’s position.

This doesn’t always have to be the case. We can assign another column as the index; however, we will wait to discuss this in the next module.

Here are the first 15 rows of the dataframe.

cereal.head(15)

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.00	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.00	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.00	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.00	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.00	0.75	34.384843
5	Apple Cinnamon Cheerios	G	Cold	110	2	2	180	1.5	10.5	10	70	25	1	1.00	0.75	29.509541
6	Apple Jacks	K	Cold	110	2	0	125	1.0	11.0	14	30	25	2	1.00	1.00	33.174094
7	Basic 4	G	Cold	130	3	2	210	2.0	18.0	8	100	25	3	1.33	0.75	37.038562
8	Bran Chex	R	Cold	90	2	1	200	4.0	15.0	6	125	25	1	1.00	0.67	49.120253
9	Bran Flakes	P	Cold	90	3	0	210	5.0	13.0	5	190	25	3	1.00	0.67	53.313813
10	Cap'n'Crunch	Q	Cold	120	1	2	220	0.0	12.0	12	35	25	2	1.00	0.75	18.042851
11	Cheerios	G	Cold	110	6	2	290	2.0	17.0	1	105	25	1	1.00	1.25	50.764999
12	Cinnamon Toast Crunch	G	Cold	120	1	3	210	0.0	13.0	9	45	25	2	1.00	0.75	19.823573
13	Clusters	G	Cold	110	3	2	140	2.0	13.0	7	105	25	3	1.00	0.50	40.400208
14	Cocoa Puffs	G	Cold	110	1	1	180	0.0	12.0	13	55	25	2	1.00	1.00	22.736446

Let’s talk about observation 4 which is named Almond Delight. Its index label is 4 as well as it’s index position.

If you just went and counted those again and started screaming “5! It’s the fifth position”, that’s OK. In the Python language, we start counting at position 0 (then 1, 2, 3, and 4 for Almond Delight).

Now let’s say we want all 5 rows past Almond Delight. That means we want the rows from Apple Cinnamon Cheerios to Cap'n'Crunch.

We will use .loc[] with square brackets to cut the dataframe from “Apple Cinnamon Cheerios” to “Cap’n’Crunch,” keeping the columns and everything between.

This code is written as cereal.loc[5:10], and it can be interpreted as “Obtain the rows in the dataframe located from 5 to 10.”

cereal.loc[5:10]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
5	Apple Cinnamon Cheerios	G	Cold	110	2	2	180	1.5	10.5	10	70	25	1	1.00	0.75	29.509541
6	Apple Jacks	K	Cold	110	2	0	125	1.0	11.0	14	30	25	2	1.00	1.00	33.174094
7	Basic 4	G	Cold	130	3	2	210	2.0	18.0	8	100	25	3	1.33	0.75	37.038562
8	Bran Chex	R	Cold	90	2	1	200	4.0	15.0	6	125	25	1	1.00	0.67	49.120253
9	Bran Flakes	P	Cold	90	3	0	210	5.0	13.0	5	190	25	3	1.00	0.67	53.313813
10	Cap'n'Crunch	Q	Cold	120	1	2	220	0.0	12.0	12	35	25	2	1.00	0.75	18.042851

What if we only wanted certain columns now?

Perhaps we were only interested in the calories to fiber columns of the “Apple Cinnamon Cheerios” to “Cap’n’Crunch” rows?

We put in the interval of our desired rows first, and then the columns, and we separate these values with a comma.

.loc[] is used to slice columns and rows by label, and within an interval.

cereal.loc[5:10, 'calories':'fiber']

	calories	protein	fat	sodium	fiber
5	110	2	2	180	1.5
6	110	2	0	125	1.0
7	130	3	2	210	2.0
8	90	2	1	200	4.0
9	90	3	0	210	5.0
10	120	1	2	220	0.0

The general format to slice both rows and columns together looks like this:

cereal.loc['row name start':'row name end', 'column name start':'column name end']

There is a handy shortcut for slices that include the beginning of a dataframe to a specified row or column label or a specified row or column label to the end of a dataframe.

For example if we want all the rows up to “Apple Jacks” which has a label equal to 6, we could omit the first label in the code all together. So we can write cereal.loc[:6].

cereal.loc[:6]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843
5	Apple Cinnamon Cheerios	G	Cold	110	2	2	180	1.5	10.5	10	70	25	1	1.0	0.75	29.509541
6	Apple Jacks	K	Cold	110	2	0	125	1.0	11.0	14	30	25	2	1.0	1.00	33.174094

Or we can do something similar for the end of a dataframe. Let’s say now we want all the rows up to Apple Jacks and only the columns from sugars onward to the end of the dataframe. What we would write in this case is cereal.loc[:6, 'sugars':].

We would omit the ending label this time after the : (colon) .

cereal.loc[:6, 'sugars':]

	sugars	potass	vitamins	shelf	weight	cups	rating
0	6	280	25	3	1.0	0.33	68.402973
1	8	135	0	3	1.0	1.00	33.983679
2	5	320	25	3	1.0	0.33	59.425505
3	0	330	25	3	1.0	0.50	93.704912
4	8	1	25	3	1.0	0.75	34.384843
5	10	70	25	1	1.0	0.75	29.509541
6	14	30	25	2	1.0	1.00	33.174094

Let’s apply what we learned!

My dataframe object name is fruit_salad.

           name    colour    location    seed   shape  sweetness   water-content  weight
       apple       red     canada    True   round     True          84         100
      banana    yellow     mexico   False    long     True          75         120
  cantaloupe    orange      spain    True   round     True          90        1360
dragon-fruit   magenta      china    True   round    False          96         600
  elderberry    purple    austria   False   round     True          80           5
         fig    purple     turkey   False    oval    False          78          40
       guava     green     mexico    True    oval     True          83         450
 huckleberry      blue     canada    True   round     True          73           5
        kiwi     brown      china    True   round     True          80          76
       lemon    yellow     mexico   False    oval    False          83          65

1. If you wanted only the rows from cantaloupe to kiwi, what would your code look like using index labels?
a) fruit_salad.loc[2, 8]
b) fruit_salad[2, 8]
c) fruit_salad[2:8]
d) fruit_salad.loc[2:8]

2. If you wanted all the rows between cantaloupe and fig and only columns name to seed, what would your code look like using index labels?
a) fruit_salad.loc[2:5, "colour":"seed"]
b) fruit_salad.loc[2:5, "name":"seed"]
c) fruit_salad.loc["name":"seed", 2:5]
d) fruit_salad[2:8, "name":"seed"]

Solutions!

d) fruit_salad.loc[2:8]
b) fruit_salad.loc[2:5, "name":"seed"]

	calories	protein	fat	sodium	fiber
5	110	2	2	180	1.5
6	110	2	0	125	1.0
7	130	3	2	210	2.0
8	90	2	1	200	4.0
9	90	3	0	210	5.0
10	120	1	2	220	0.0

	calories	protein	fat	sodium	fiber
5	110	2	2	180	1.5
6	110	2	0	125	1.0
7	130	3	2	210	2.0
8	90	2	1	200	4.0
9	90	3	0	210	5.0
10	120	1	2	220	0.0

Programming in Python for Data Science

Slicing Rows using .loc[]¶

	calories	protein	fat	sodium	fiber
5	110	2	2	180	1.5
6	110	2	0	125	1.0
7	130	3	2	210	2.0
8	90	2	1	200	4.0
9	90	3	0	210	5.0
10	120	1	2	220	0.0