Data Filtering¶

Exercise

See the accompanied youtube video at the link here.

import pandas as pd
import numpy as np

pd.set_option('display.width', 350)
np.set_printoptions(linewidth=400)
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_rows', 20)

Filtering is probably one of the most frequent data manipulations you will do in data analysis.

Filtering is often used when we are either trying to rid the dataframe of unwanted rows or trying to analyze rows with a particular column value.

Let’s try to filter the cereal.csv dataset.

cereal = pd.read_csv('cereal.csv')
cereal.head()

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843

Conditions¶

Suppose you are trying to find the information for cereals with a protein content greater than 4g per serving.

Our first instinct would be to write code that looks somewhat like this.

cereal['protein'] > 4

   False
   False
   False
   False
   False
      ...  
  False
  False
  False
  False
  False
Name: protein, Length: 77, dtype: bool

This can be translated as

“From the protein column in the dataframe cereal, which have values greater than 4?”

The output shows all the index labels and a column with True or False values depending on if the row meets the condition. Cereals with True have a protein content greater than 4 and False if they do not.

But we want a dataframe with all the information that only contains the rows with protein above 4.

How can this be achieved?

We can do this by indexing into our cereal dataframe using this column of True/False values. The result is a smaller dataframe that only contains the rows corresponding to the True values.

cereal[cereal['protein'] > 4]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
11	Cheerios	G	Cold	110	6	2	290	2.0	17.0	1	105	25	1	1.0	1.25	50.764999
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392
67	Special K	K	Cold	110	6	0	230	1.0	16.0	3	55	25	1	1.0	1.00	53.131324

This code can be translated to:

Select the rows from the cereal dataframe that, according to the cereal dataframe, have a protein values greater than 4.

We can see from the output that only the rows meeting the condition are displayed.

By the way, it is a common pattern that we’re using the same dataframe twice, namely cereal, but it’s not strictly required by pandas.

We can do this with equalities as well.

cereal[cereal['protein'] == 4]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
43	Maypo	A	Hot	100	4	1	0	0.0	16.0	3	95	25	2	1.0	1.00	54.850917
44	Muesli Raisins; Dates; & Almonds	R	Cold	150	4	3	95	3.0	16.0	11	170	25	3	1.0	1.00	37.136863
45	Muesli Raisins; Peaches; & Pecans	R	Cold	150	4	3	150	3.0	16.0	11	170	25	3	1.0	1.00	34.139765
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874

Now we get all the cereals with a protein content of exactly 4g per serving.

The key point to remember here is that we use two equal signs.

In Python, a single = is used as an assignment operator. We are setting something to equal something else.

The double equal sign operator is used for comparison. We check if certain values are equivalent to one another.

By the way, these conventions were set a long time ago when people made the early programming languages. In hindsight, maybe something like =? would have been more clear, but the double equal sign for comparison is now a standard.

We can filter categorical columns too. In this example, we only want cereals from the manufacturer “Q” (For Quaker):

Here, we are using the double equal sign operator that we saw in the last slide.

cereal[cereal['mfr'] == 'Q']

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
10	Cap'n'Crunch	Q	Cold	120	1	2	220	0.0	12.0	12	35	25	2	1.0	0.75	18.042851
35	Honey Graham Ohs	Q	Cold	120	1	2	220	1.0	12.0	11	45	25	2	1.0	1.00	21.871292
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
54	Puffed Rice	Q	Cold	50	1	0	0	0.0	13.0	0	15	0	3	0.5	1.00	60.756112
55	Puffed Wheat	Q	Cold	50	2	0	0	1.0	10.0	0	50	0	3	0.5	1.00	63.005645
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392

Multiple Condition Filtering - “and”¶

We now know how to filter on one condition but how do we filter if we have many?

Perhaps we only want cereals with protein content between 4 to 5 grams?

To find the cereals that meet protein contents greater or equal to 4, we use the code shown here.

cereal[cereal['protein'] >= 4]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
11	Cheerios	G	Cold	110	6	2	290	2.0	17.0	1	105	25	1	1.0	1.25	50.764999
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
43	Maypo	A	Hot	100	4	1	0	0.0	16.0	3	95	25	2	1.0	1.00	54.850917
44	Muesli Raisins; Dates; & Almonds	R	Cold	150	4	3	95	3.0	16.0	11	170	25	3	1.0	1.00	37.136863
45	Muesli Raisins; Peaches; & Pecans	R	Cold	150	4	3	150	3.0	16.0	11	170	25	3	1.0	1.00	34.139765
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392
67	Special K	K	Cold	110	6	0	230	1.0	16.0	3	55	25	1	1.0	1.00	53.131324

And the cereals that meet the condition of protein content below or equal to 5 would be obtained as shown here.

cereal[cereal['protein'] <= 5]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	G	Cold	110	2	1	250	0.0	21.0	3	60	25	3	1.0	0.75	39.106174
73	Trix	G	Cold	110	1	1	140	0.0	13.0	12	25	25	2	1.0	1.00	27.753301
74	Wheat Chex	R	Cold	100	3	1	230	3.0	17.0	3	115	25	1	1.0	0.67	49.787445
75	Wheaties	G	Cold	100	3	1	200	3.0	17.0	3	110	25	1	1.0	1.00	51.592193
76	Wheaties Honey Gold	G	Cold	110	2	1	200	1.0	16.0	8	60	25	1	1.0	0.75	36.187559

75 rows × 16 columns

We can combine the two conditions using the & operator. This allows us to obtain cereals that meet both conditions.

cereal[cereal['protein'] >= 4]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
11	Cheerios	G	Cold	110	6	2	290	2.0	17.0	1	105	25	1	1.0	1.25	50.764999
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
43	Maypo	A	Hot	100	4	1	0	0.0	16.0	3	95	25	2	1.0	1.00	54.850917
44	Muesli Raisins; Dates; & Almonds	R	Cold	150	4	3	95	3.0	16.0	11	170	25	3	1.0	1.00	37.136863
45	Muesli Raisins; Peaches; & Pecans	R	Cold	150	4	3	150	3.0	16.0	11	170	25	3	1.0	1.00	34.139765
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392
67	Special K	K	Cold	110	6	0	230	1.0	16.0	3	55	25	1	1.0	1.00	53.131324

cereal[cereal['protein'] <= 5]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	G	Cold	110	2	1	250	0.0	21.0	3	60	25	3	1.0	0.75	39.106174
73	Trix	G	Cold	110	1	1	140	0.0	13.0	12	25	25	2	1.0	1.00	27.753301
74	Wheat Chex	R	Cold	100	3	1	230	3.0	17.0	3	115	25	1	1.0	0.67	49.787445
75	Wheaties	G	Cold	100	3	1	200	3.0	17.0	3	110	25	1	1.0	1.00	51.592193
76	Wheaties Honey Gold	G	Cold	110	2	1	200	1.0	16.0	8	60	25	1	1.0	0.75	36.187559

75 rows × 16 columns

cereal[(cereal['protein'] >= 4) & (cereal['protein'] <= 5)]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
43	Maypo	A	Hot	100	4	1	0	0.0	16.0	3	95	25	2	1.0	1.00	54.850917
44	Muesli Raisins; Dates; & Almonds	R	Cold	150	4	3	95	3.0	16.0	11	170	25	3	1.0	1.00	37.136863
45	Muesli Raisins; Peaches; & Pecans	R	Cold	150	4	3	150	3.0	16.0	11	170	25	3	1.0	1.00	34.139765
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392

The & indicates “and”. This means that both conditions must hold for a row to be included in the new dataframe.

Each condition is wrapped with parentheses to keep them clearly separated.

Only rows present in both dataframes will be selected.

Next, we will look at a case where we filter on two different columns.

Let’s say we only want cereals from the Quaker manufacturer, with a protein content greater than 4.

The same coding syntax can be applied to two different column conditions.

cereal[(cereal['mfr'] == 'Q') & (cereal['protein'] > 4)]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392

Multiple Condition Filtering - “or”¶

Suppose that we are interested in cereals that either are made from the Quaker manufacturer OR a protein content above 4.

For a row to be included in the output, we only require one or the other condition to hold.

Instead of using the & symbol, we use | which is called the “pipe operator”. This means “or” in the Python programming language (and many other languages).

cereal[(cereal['mfr'] == 'Q') | (cereal['protein'] > 4)]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
10	Cap'n'Crunch	Q	Cold	120	1	2	220	0.0	12.0	12	35	25	2	1.0	0.75	18.042851
11	Cheerios	G	Cold	110	6	2	290	2.0	17.0	1	105	25	1	1.0	1.25	50.764999
35	Honey Graham Ohs	Q	Cold	120	1	2	220	1.0	12.0	11	45	25	2	1.0	1.00	21.871292
41	Life	Q	Cold	100	4	2	150	2.0	12.0	6	95	25	2	1.0	0.67	45.328074
54	Puffed Rice	Q	Cold	50	1	0	0	0.0	13.0	0	15	0	3	0.5	1.00	60.756112
55	Puffed Wheat	Q	Cold	50	2	0	0	1.0	10.0	0	50	0	3	0.5	1.00	63.005645
56	Quaker Oat Squares	Q	Cold	100	4	1	135	2.0	14.0	6	110	25	3	1.0	0.50	49.511874
57	Quaker Oatmeal	Q	Hot	100	5	2	0	2.7	1.0	1	110	0	1	1.0	0.67	50.828392
67	Special K	K	Cold	110	6	0	230	1.0	16.0	3	55	25	1	1.0	1.00	53.131324

When we filter using “or” this time, it resulted in 10 cereals that met either of the conditions.

When we filtered using “and”, only 1 cereal met both conditions.

Tilde¶

We saw that when we filter the conditions are expressed with an underlying column with True or False values indicating if the condition has been met in each row of the dataframe.

But what if I wanted the rows that were the complement (or opposite) of this?

cereal['protein'] > 4

   False
   False
   False
   False
   False
      ...  
  False
  False
  False
  False
  False
Name: protein, Length: 77, dtype: bool

(cereal['protein'] > 4).head()

The opposite of `cereal['protein'] > 4` is `cereal['protein'] <= 4`, so
that one isn’t too tricky. But sometimes taking the opposite is not so
straightforward. This is where the `~` (“tilde”) operator can be
helpful.


Tilde converts all the `True` values to `False` and all the `False`
values, to `True.`


```{code-cell} ipython3
(~(cereal['protein'] > 4)).head()

  File "/tmp/ipykernel_2133/3894984370.py", line 3
    The opposite of `cereal['protein'] > 4` is `cereal['protein'] <= 4`, so
               ^
SyntaxError: invalid syntax

Tilde (~) gives us the ability to return the complement of the code following it.

We can obtain the complete dataframe by putting the entire condition within our square brackets like we did before.

What we have here, is taking the rows where the protein content is not greater than four.

cereal[~(cereal['protein'] > 4)]

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	Cold	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	Cold	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	Cold	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	Cold	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	Cold	110	2	2	200	1.0	14.0	8	1	25	3	1.0	0.75	34.384843
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	G	Cold	110	2	1	250	0.0	21.0	3	60	25	3	1.0	0.75	39.106174
73	Trix	G	Cold	110	1	1	140	0.0	13.0	12	25	25	2	1.0	1.00	27.753301
74	Wheat Chex	R	Cold	100	3	1	230	3.0	17.0	3	115	25	1	1.0	0.67	49.787445
75	Wheaties	G	Cold	100	3	1	200	3.0	17.0	3	110	25	1	1.0	1.00	51.592193
76	Wheaties Honey Gold	G	Cold	110	2	1	200	1.0	16.0	8	60	25	1	1.0	0.75	36.187559

74 rows × 16 columns

This gives us more versatility when filtering, especially when we want the inverse of more complicated conditions and verbs (you’ll see this in Module 3).

Let’s apply what we learned!

1. If the output of

df['location'] == 'Canada'

is

[ True, False, False, True]

What would be the output of

 ~(df['location'] == 'Canada')

 
a) `[True, False, False, True]`      
b) `[False, False, False, False]`       
c) `[True, True, True, True]`         
d) `[False, True, True, False]`         

Solutions!

d) [False, True, True, False]

Programming in Python for Data Science