Slicing and selecting using .iloc[]¶
Watch it
See the accompanied youtube video at the link here.
Up until point, we have been manipulating our dataframe with column and
row labels using .loc[]
.
Slicing can also be done by the location position of each row with
.iloc[]
.
.iloc[]
is very similar to .loc[]
, however, the “i” in iloc
refers
to the index integer position.
Slicing Dataframe¶
We are going to return to our cereal dataset and take a look at the first 10 rows.
cereal = pd.read_csv('cereal.csv')
cereal.head(10)
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 | 130 | ... | 6 | 280 | 25 | 3 | 1.00 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | ... | 8 | 135 | 0 | 3 | 1.00 | 1.00 | 33.983679 |
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.00 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | ... | 0 | 330 | 25 | 3 | 1.00 | 0.50 | 93.704912 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 | 200 | ... | 8 | 1 | 25 | 3 | 1.00 | 0.75 | 34.384843 |
5 | Apple Cinnamon Cheerios | G | Cold | 110 | 2 | 2 | 180 | ... | 10 | 70 | 25 | 1 | 1.00 | 0.75 | 29.509541 |
6 | Apple Jacks | K | Cold | 110 | 2 | 0 | 125 | ... | 14 | 30 | 25 | 2 | 1.00 | 1.00 | 33.174094 |
7 | Basic 4 | G | Cold | 130 | 3 | 2 | 210 | ... | 8 | 100 | 25 | 3 | 1.33 | 0.75 | 37.038562 |
8 | Bran Chex | R | Cold | 90 | 2 | 1 | 200 | ... | 6 | 125 | 25 | 1 | 1.00 | 0.67 | 49.120253 |
9 | Bran Flakes | P | Cold | 90 | 3 | 0 | 210 | ... | 5 | 190 | 25 | 3 | 1.00 | 0.67 | 53.313813 |
10 rows × 16 columns
Let’s say we want the rows from All-Bran
to Apple Cinnamon Cheerios
,
but we want to slice based on their position instead of their label.
Using Python’s counting method of starting at zero, we conclude that
All-Bran
to be at position to 2.
We get Apple Cinnamon Cheerios
position to be 5 in the same way.
We are lucky with this dataframe because our index labels match the position of the rows, and this makes things a little bit simpler.
We can use the same coding structure we learned with .loc[]
but this
time using row positions instead of labels with .iloc[]
.
cereal.loc[2:5]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | ... | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 | 200 | ... | 8 | 1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
5 | Apple Cinnamon Cheerios | G | Cold | 110 | 2 | 2 | 180 | ... | 10 | 70 | 25 | 1 | 1.0 | 0.75 | 29.509541 |
4 rows × 16 columns
cereal.iloc[2:5]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | ... | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 | 200 | ... | 8 | 1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
3 rows × 16 columns
But wait! Something is missing here!
Why doesn’t Apple Cinnamon Cheerios
appear in the dataframe?
The reason for this is that when we use slicing with indices, it will take all the indices including the lower bound but EXCLUDING the upper bound.
If we want to include Apple Cinnamon Cheerios
we would have to go 1
index position further.
cereal.iloc[2:6]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | ... | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 | 200 | ... | 8 | 1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
5 | Apple Cinnamon Cheerios | G | Cold | 110 | 2 | 2 | 180 | ... | 10 | 70 | 25 | 1 | 1.0 | 0.75 | 29.509541 |
4 rows × 16 columns
If we think about this a bit it actually make some sense. Think about
the calculation 6 - 2 = 4
. We get 4 items remaining which is the
amount of cereals we want in our in new dataframe.
The same concept can be applied to the columns of the dataframe.
Let’s say we want all the rows but we only want the columns starting at
name
and ending (and including) at column fat
.
Using the logic we learned in the last section, we would use the
following code.
We would need to specify all rows using :
as we did when we used
.loc[]
.
The column name
is at index position 0 (we do not include the index
label as a column) and fat
is at index position 5.
Since we want to include the 5th column we need to use the 6th position to make sure we get all the columns BEFORE the upper bound.
cereal.iloc[:, 0:6]
name | mfr | type | calories | protein | fat | |
---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 |
2 | All-Bran | K | Cold | 70 | 4 | 1 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 |
... | ... | ... | ... | ... | ... | ... |
72 | Triples | G | Cold | 110 | 2 | 1 |
73 | Trix | G | Cold | 110 | 1 | 1 |
74 | Wheat Chex | R | Cold | 100 | 3 | 1 |
75 | Wheaties | G | Cold | 100 | 3 | 1 |
76 | Wheaties Honey Gold | G | Cold | 110 | 2 | 1 |
77 rows × 6 columns
Let’s say we want the rows All-Bran
to Apple Cinnamon Cheerios
and
name
to fat
.
For rows, the lower bound All-Bran
is located at position 2 and the
upper bound Apple Cinnamon Cheerios
is located at position 5.
Now the column’s lower bound name
is located at position 0 and the
upper bound fat
is located at position 5.
The same would apply if we only wanted certain rows with certain columns.
Both of our upper bound have been compensated with an added 1 to make sure they are included in the new dataframe.
So the code we have to use to do this is the following:
cereal.iloc[2:6, 0:6]
name | mfr | type | calories | protein | fat | |
---|---|---|---|---|---|---|
2 | All-Bran | K | Cold | 70 | 4 | 1 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 |
5 | Apple Cinnamon Cheerios | G | Cold | 110 | 2 | 2 |
There are multiple different way of writing code when you are selecting items from the beginning or end of your data.
Perhaps you only want the first 3 rows of your data.
We can use .head(3)
or we can use .iloc[]
.
cereal.head(3)
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 | 130 | ... | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | ... | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 rows × 16 columns
cereal.iloc[0:3]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 | 130 | ... | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | ... | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 rows × 16 columns
Since we are indicating the beginning of the dataframe, we can omit the
upper bound 0
just like we did when we learned slicing with .loc[]
.
cereal.iloc[:3]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 | 130 | ... | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | ... | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
3 rows × 16 columns
The same logic can be applied for end of a dataframe. This time we want the last 3 rows.
Since this dataframe has 76 rows we could use our lower bound as 74 and upper bound as 77 (76 +1).
cereal.iloc[74:77]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | Wheat Chex | R | Cold | 100 | 3 | 1 | 230 | ... | 3 | 115 | 25 | 1 | 1.0 | 0.67 | 49.787445 |
75 | Wheaties | G | Cold | 100 | 3 | 1 | 200 | ... | 3 | 110 | 25 | 1 | 1.0 | 1.00 | 51.592193 |
76 | Wheaties Honey Gold | G | Cold | 110 | 2 | 1 | 200 | ... | 8 | 60 | 25 | 1 | 1.0 | 0.75 | 36.187559 |
3 rows × 16 columns
A better and easier way to write this, where you don’t even need to know the number of rows in the dataframe would be to specify you are counting your rows from the bottom with a negative in front of the number of rows you want.
This example takes the last 3 rows of the bottom of the dataframe.
Since we are collecting data to the end of the dataframe, we do not need to include the ending row index number.
cereal.iloc[-3:]
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
74 | Wheat Chex | R | Cold | 100 | 3 | 1 | 230 | ... | 3 | 115 | 25 | 1 | 1.0 | 0.67 | 49.787445 |
75 | Wheaties | G | Cold | 100 | 3 | 1 | 200 | ... | 3 | 110 | 25 | 1 | 1.0 | 1.00 | 51.592193 |
76 | Wheaties Honey Gold | G | Cold | 110 | 2 | 1 | 200 | ... | 8 | 60 | 25 | 1 | 1.0 | 0.75 | 36.187559 |
3 rows × 16 columns
Selecting with .iloc[]¶
Selecting using .iloc[]
is done identically to .loc[]
, however, the
items within each set of square brackets MUST be integers, and not
in quotation marks.
cereal.head(10)
name | mfr | type | calories | protein | fat | sodium | ... | sugars | potass | vitamins | shelf | weight | cups | rating | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100% Bran | N | Cold | 70 | 4 | 1 | 130 | ... | 6 | 280 | 25 | 3 | 1.00 | 0.33 | 68.402973 |
1 | 100% Natural Bran | Q | Cold | 120 | 3 | 5 | 15 | ... | 8 | 135 | 0 | 3 | 1.00 | 1.00 | 33.983679 |
2 | All-Bran | K | Cold | 70 | 4 | 1 | 260 | ... | 5 | 320 | 25 | 3 | 1.00 | 0.33 | 59.425505 |
3 | All-Bran with Extra Fiber | K | Cold | 50 | 4 | 0 | 140 | ... | 0 | 330 | 25 | 3 | 1.00 | 0.50 | 93.704912 |
4 | Almond Delight | R | Cold | 110 | 2 | 2 | 200 | ... | 8 | 1 | 25 | 3 | 1.00 | 0.75 | 34.384843 |
5 | Apple Cinnamon Cheerios | G | Cold | 110 | 2 | 2 | 180 | ... | 10 | 70 | 25 | 1 | 1.00 | 0.75 | 29.509541 |
6 | Apple Jacks | K | Cold | 110 | 2 | 0 | 125 | ... | 14 | 30 | 25 | 2 | 1.00 | 1.00 | 33.174094 |
7 | Basic 4 | G | Cold | 130 | 3 | 2 | 210 | ... | 8 | 100 | 25 | 3 | 1.33 | 0.75 | 37.038562 |
8 | Bran Chex | R | Cold | 90 | 2 | 1 | 200 | ... | 6 | 125 | 25 | 1 | 1.00 | 0.67 | 49.120253 |
9 | Bran Flakes | P | Cold | 90 | 3 | 0 | 210 | ... | 5 | 190 | 25 | 3 | 1.00 | 0.67 | 53.313813 |
10 rows × 16 columns
Let’s say we want the rows Almond Delight
, Basic 4
and Apple Jacks
with the columns name
, calories
, fat
and type
and in that
specific order.
|
|
Almond Delight
takes row position 4, Basic 4
takes row position 7
and Apple Jacks
is located at position 6. The desired columns name
,
calories
, fat
, and type
take column index positions 0, 5, 3, and 2
respectively.
Now let’s put those positions into square backing within df.iloc[]
.
cereal.iloc[[4, 7, 6], [0, 3, 5, 2]]
name | calories | fat | type | |
---|---|---|---|---|
4 | Almond Delight | 110 | 2 | Cold |
7 | Basic 4 | 130 | 2 | Cold |
6 | Apple Jacks | 110 | 0 | Cold |
Let’s apply what we learned!
Here is our fruit_salad
data again:
name colour location seed shape sweetness water-content weight
0 apple red canada True round True 84 100
1 banana yellow mexico False long True 75 120
2 cantaloupe orange spain True round True 90 1360
3 dragon-fruit magenta china True round False 96 600
4 elderberry purple austria False round True 80 5
5 fig purple turkey False oval False 78 40
6 guava green mexico True oval True 83 450
7 huckleberry blue canada True round True 73 5
8 kiwi brown china True round True 80 76
9 lemon yellow mexico False oval False 83 65
1. If I wanted the rows elderberry
to kiwi
and only columns seeds
, shape
, sweetness
and water-content
, what would my code look like if I was using index positions?
a) fruit_salad.iloc[4:9, 3:]
b) fruit_salad.iloc[4:8, 3:7]
c) fruit_salad.iloc[4:9, 3:7]
d) fruit_salad.iloc[5:10, 4:8]
2. If I wanted the rows lemon
and cantaloupe
but only the columns colour
, weight
and seeds
in that order, what would my code look like if I was using index position?
a) fruit_salad.iloc[[lemon, cantaloupe], [colour, weight, seeds]]
b) fruit_salad.iloc[[10, 3], [2, 8, 4]]
c) fruit_salad.iloc[[9, 2], [1, 7, 3]]
d) fruit_salad[[9, 2], [1, 7, 3]]
Solutions!
c)
fruit_salad.iloc[4:9, 3:7]
c)
fruit_salad.iloc[[9, 2], [1, 7, 3]]