Quick viz with Altair!

Watch it

See the accompanied youtube video at the link here.

If we want to visualize things using different plots, we can do that pretty quickly and with little code!

Take the cereal object we analyzed in the last section.

cereal
name mfr type calories protein fat sodium ... sugars potass vitamins shelf weight cups rating
0 100% Bran N Cold 70 4 1 130 ... 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q Cold 120 3 5 15 ... 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K Cold 70 4 1 260 ... 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K Cold 50 4 0 140 ... 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R Cold 110 2 2 200 ... 8 1 25 3 1.0 0.75 34.384843
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
72 Triples G Cold 110 2 1 250 ... 3 60 25 3 1.0 0.75 39.106174
73 Trix G Cold 110 1 1 140 ... 12 25 25 2 1.0 1.00 27.753301
74 Wheat Chex R Cold 100 3 1 230 ... 3 115 25 1 1.0 0.67 49.787445
75 Wheaties G Cold 100 3 1 200 ... 3 110 25 1 1.0 1.00 51.592193
76 Wheaties Honey Gold G Cold 110 2 1 200 ... 8 60 25 1 1.0 0.75 36.187559

77 rows × 16 columns

Let say we are interested in the manufacturer column. It would be great to express the frequency of the item in that column as a bar chart.

But how do we do that?

To do this, we are going to use a very nifty package called Altair.

Altair is a data visualization tool that produces plots relatively easily.

Like any other package we have seen so far, Altair needs to be imported before we can use it.

import altair as alt

We can plot the mfr column frequencies using Altair using the following code.

chart0 = alt.Chart(cereal).mark_bar().encode(
    x='mfr',
    y='count()'
)
chart0

See how quick that was? Just five lines!

Now let’s take a moment and go through the steps of what each line means.

To make a bar plot using altair, we follow the steps below:

1. First, we create an altair plot object using alt.chart().

alt.chart(...)...

2. Next, we pass the dataframe we’d like to plot in to altair.chart(). So here, that is the cereeal dataframe.

alt.chart(cereal)

3. But what kind of plot do we want?! As we said before, a bar chart would be suitable for this type of data. So let’s add .mark_bar() to specify that.

alt.chart(cereal).mark_bar()...

4. Next, we need to say what goes on the y-axis and the x-axis. We do this inside of the encode() call. So inside of encode, we say what should be represented on the y-axis and what should be represented on the x-axis. Here on the x-axis, we put the manufacturer, and on the y-axis, we us count: .encode(x='mfr', y='count()').

alt.chart(cereal).mark_bar().encode(
  x='mfr', 
  y='count()')

count() is used here to count the occurrences or the number of rows in the cereal dataframe that contains a specific manufacturer.

In general, we use count() if we are interested in counting the frequency of each of elements in the x variable.

This gives us all the code necessary for our desired plot now.

For this example we are saving our plot as an object named chart0.

The important things to notice here is that we want create a alt.chart() object and then specify that we want a .mark_bar() graph and then specifying which column using .encode().

Here is our plot again.

chart1 = alt.Chart(cereal, width=500, height=300).mark_bar().encode(
    x='mfr',
    y='count()'
)
chart1

It looks a little different this time. The first time we plotted it, it was a little too small. So inside the alt.Chart call, we added a width and height argument so that we can make the plot bigger.

What else can we plot from our original cereal dataframe named cereal?

Maybe we want to see the relationship between sugars and calories in the cereals?

This would require a scatter plot which can be done by specifying mark_circle instead of mark_bar and in the encode function, we need to say what is going to be on the x and the y axis.

chart2 = alt.Chart(cereal, width=500, height=300).mark_circle().encode(
    x='sugars',
    y='calories'
)
chart2

In this case, we are putting sugars on the x-axis and calories on the y-axis.

Something you may have noticed is that there are 77 cereals but there doesn’t seem to be 77 data points!

That’s because some of them are lying on top of each other with the same sugar and calorie values.

One way we can deal with this is by changing the opacity of each of those points. That way, the darker points represent that there is more than one data point at that point in the chart, and the lightest point represent that there is only one data point there.

We set opacity with opacity in the mark_circle() function and it accepts values between 0 and 1, with 1 being full opacity. Here we set it at 0.3.

chart3 = alt.Chart(cereal, width=500, height=300).mark_circle(opacity=0.3).encode(
    x='sugars',
    y='calories'
)
chart3

Look at that! Now we can see there are multiple cereals that have 3.5g of sugar with 110 calories.

What if you don’t fancy the default plot colour blue?

Well that’s okay, we can change the colour easily using the color argument in .mark_circle().

chart4 = alt.Chart(cereal, width=500, height=300).mark_circle(color='red', opacity=0.3).encode(
    x='sugars',
    y='calories'
)
chart4

Here we have changed the colour to red.

What if the data points seem a little too small? That is no problem, we can also increase these. Again in the mark_circle() call. Here we add an argument where we say the size. So we have changed the size from the default to a size of 80, and we can see that the points are now larger.

chart5 = alt.Chart(cereal, width=500, height=300).mark_circle(color='red', size=80, opacity=0.3).encode(
    x='sugars',
    y='calories'
)
chart5

Every good graph should have a title!

A title provides useful information about what the plot is about.

Let’s take this opportunity to finish off our scatter plot and set the argument title to something as well.

chart6 = alt.Chart(cereal, width=500, height=300).mark_circle(color='red', size=80, opacity=0.3).encode(
    x='sugars',
    y='calories'
).properties(title="Scatter plot sugars vs calories for different cereals")
chart6

So here we have called it “Scatter plot sugars vs calories for different cereals”.

We use the .properties() function to do this.