Slicing Rows using .loc[]

Watch it

See the accompanied youtube video at the link here.

Congratulations on writing your first code!

We have read in our data, and we know the dimensions. Well, now what?

Let’s go over how we would index, slice, and select certain columns or rows of our data.

Let’s start by importing pandas and loading in a dataset named cereal.csv, and we will save it as cereal.

import pandas as pd
  
cereal = pd.read_csv('cereal.csv')
cereal.head()
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N Cold 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q Cold 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K Cold 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R Cold 110 2 2 200 1.0 14.0 8 1 25 3 1.0 0.75 34.384843

Attribution:
80 Cereals” (c) by Chris Crawford is licensed under Creative Commons Attribution-ShareAlike 3.0 Unported.

We can see all the columns and the first 5 rows of the dataframe using .head()

However, let’s say we only want certain rows of the dataframe or certain columns.

We talked about how .head() will generate the first few rows of a dataframe (5 as default), but what if we wanted the rows from 5-10?

The first column of this dataframe is called the index.

Each row has a label (the index) as well as a position. In this case, the index label of an observation is the same as it’s position.

This doesn’t always have to be the case. We can assign another column as the index; however, we will wait to discuss this in the next module.

Here are the first 15 rows of the dataframe.

cereal.head(15)
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N Cold 70 4 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.402973
1 100% Natural Bran Q Cold 120 3 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.983679
2 All-Bran K Cold 70 4 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.704912
4 Almond Delight R Cold 110 2 2 200 1.0 14.0 8 1 25 3 1.00 0.75 34.384843
5 Apple Cinnamon Cheerios G Cold 110 2 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.509541
6 Apple Jacks K Cold 110 2 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.174094
7 Basic 4 G Cold 130 3 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.038562
8 Bran Chex R Cold 90 2 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.120253
9 Bran Flakes P Cold 90 3 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.313813
10 Cap'n'Crunch Q Cold 120 1 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.042851
11 Cheerios G Cold 110 6 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.764999
12 Cinnamon Toast Crunch G Cold 120 1 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.823573
13 Clusters G Cold 110 3 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.400208
14 Cocoa Puffs G Cold 110 1 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.736446

Let’s talk about observation 4 which is named Almond Delight. Its index label is 4 as well as it’s index position.

If you just went and counted those again and started screaming “5! It’s the fifth position”, that’s OK. In the Python language, we start counting at position 0 (then 1, 2, 3, and 4 for Almond Delight).

Now let’s say we want all 5 rows past Almond Delight. That means we want the rows from Apple Cinnamon Cheerios to Cap'n'Crunch.

We will use .loc[] with square brackets to cut the dataframe from “Apple Cinnamon Cheerios” to “Cap’n’Crunch,” keeping the columns and everything between.

This code is written as cereal.loc[5:10], and it can be interpreted as “Obtain the rows in the dataframe located from 5 to 10.”

cereal.loc[5:10]
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
5 Apple Cinnamon Cheerios G Cold 110 2 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.509541
6 Apple Jacks K Cold 110 2 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.174094
7 Basic 4 G Cold 130 3 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.038562
8 Bran Chex R Cold 90 2 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.120253
9 Bran Flakes P Cold 90 3 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.313813
10 Cap'n'Crunch Q Cold 120 1 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.042851

What if we only wanted certain columns now?

Perhaps we were only interested in the calories to fiber columns of the “Apple Cinnamon Cheerios” to “Cap’n’Crunch” rows?

We put in the interval of our desired rows first, and then the columns, and we separate these values with a comma.

.loc[] is used to slice columns and rows by label, and within an interval.

cereal.loc[5:10, 'calories':'fiber']
calories protein fat sodium fiber
5 110 2 2 180 1.5
6 110 2 0 125 1.0
7 130 3 2 210 2.0
8 90 2 1 200 4.0
9 90 3 0 210 5.0
10 120 1 2 220 0.0

The general format to slice both rows and columns together looks like this:

cereal.loc['row name start':'row name end', 'column name start':'column name end']

There is a handy shortcut for slices that include the beginning of a dataframe to a specified row or column label or a specified row or column label to the end of a dataframe.

For example if we want all the rows up to “Apple Jacks” which has a label equal to 6, we could omit the first label in the code all together. So we can write cereal.loc[:6].

cereal.loc[:6]
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N Cold 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q Cold 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K Cold 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R Cold 110 2 2 200 1.0 14.0 8 1 25 3 1.0 0.75 34.384843
5 Apple Cinnamon Cheerios G Cold 110 2 2 180 1.5 10.5 10 70 25 1 1.0 0.75 29.509541
6 Apple Jacks K Cold 110 2 0 125 1.0 11.0 14 30 25 2 1.0 1.00 33.174094

Or we can do something similar for the end of a dataframe. Let’s say now we want all the rows up to Apple Jacks and only the columns from sugars onward to the end of the dataframe. What we would write in this case is cereal.loc[:6, 'sugars':].

We would omit the ending label this time after the : (colon) .

cereal.loc[:6, 'sugars':]
sugars potass vitamins shelf weight cups rating
0 6 280 25 3 1.0 0.33 68.402973
1 8 135 0 3 1.0 1.00 33.983679
2 5 320 25 3 1.0 0.33 59.425505
3 0 330 25 3 1.0 0.50 93.704912
4 8 1 25 3 1.0 0.75 34.384843
5 10 70 25 1 1.0 0.75 29.509541
6 14 30 25 2 1.0 1.00 33.174094

Let’s apply what we learned!

My dataframe object name is fruit_salad.

           name    colour    location    seed   shape  sweetness   water-content  weight
0         apple       red     canada    True   round     True          84         100
1        banana    yellow     mexico   False    long     True          75         120
2    cantaloupe    orange      spain    True   round     True          90        1360
3  dragon-fruit   magenta      china    True   round    False          96         600
4    elderberry    purple    austria   False   round     True          80           5
5           fig    purple     turkey   False    oval    False          78          40
6         guava     green     mexico    True    oval     True          83         450
7   huckleberry      blue     canada    True   round     True          73           5
8          kiwi     brown      china    True   round     True          80          76
9         lemon    yellow     mexico   False    oval    False          83          65

1. If you wanted only the rows from cantaloupe to kiwi, what would your code look like using index labels?
a) fruit_salad.loc[2, 8]
b) fruit_salad[2, 8]
c) fruit_salad[2:8]
d) fruit_salad.loc[2:8]

2. If you wanted all the rows between cantaloupe and fig and only columns name to seed, what would your code look like using index labels?
a) fruit_salad.loc[2:5, "colour":"seed"]
b) fruit_salad.loc[2:5, "name":"seed"]
c) fruit_salad.loc["name":"seed", 2:5]
d) fruit_salad[2:8, "name":"seed"]