Lesson 6 Visualization and data exploration

Welcome to lesson 6. In this lesson, you will be introduced basic data exploration and visualization techniques.

Follow along with this tutorial by creating a new ipython notebook named lesson06.ipynb and entering the code snippets presented.

Visualizing data is a useful technique for 1) data exploration and 2) data presentation. These two techniques require different approaches building data visualizations. Visualizations that we use to simply explore our data (and keep to ourselves) do not require as much attention to the aesthetic elements such as well-described data axes, key takeaway, and extraneous chart junk e.g. grid lines, tick marks, and borders.

Outline

  • Chart types
  • Building charts with pandas
  • Building charts with matplotlib
  • Summary
  • Exercise 6.1
  • Exercise 6.2
  • Assignment 6

6.1 Chart types

The type of chart you select to show your data is dictated by the type of data that you are working with. For example, line charts require the use of time-series data such as year, month, day, week, or minute. Without the right data, you may still be able to graph the data, but not in a form that is easy to interpret.

We'll work with the following data and chart types described in the table below.

Data Chart types
Statistical histogram, box plot, scatter plot
Categorical vertical or horizontal bar (To show rank)
Time series line or area (To show proportional change over time)

There are numerous types of charts types available in python through the pandas, matplotlib, and seaborn libraries. In this lesson we'll focus on using pandas to create statistical, categorical, and time series charts and the basics of the matplotlib library.

6.2 Building charts with pandas

Since we've been working with the pandas library to import and format data, let's continue to use it to plot our data.

There are two ways to build charts using from pandas' DataFrame class: 1)DataFrame.plot(kind) and 2) DataFrame.plot.<kind>

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods in the form DataFrame.plot.<kind>.

1) DataFrame.plot(kind)

The first way, the .plot method is used with the kind of chart specified as a parameter.

The format for using the plot method is:

DataFrame[].plot(kind,….)

It has several key parameters:

Parameter Options
kind bar,barh,pie,scatter, etc.
color accepts an array of hex codes corresponding sequential to each data series / column.
linestyle ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only)
xlim, ylim specify a tuple (lower limit, upper limit) for which the plot will be drawn
legend a Boolean value to display or hide the legend
labels a list corresponding to the number of columns in the DataFrame, a descriptive name can be provided here for the legend
title The string title of the plot

Below is an example of how to use .plot with a DataFrame using the template: DataFrame[].plot(kind,x=….])

mydata["Tuition"].plot(kind="box",title="Box plot of tuition of the top 25 MBA programs")

After the DataFrame named mydata, we use the square brackets to reference the column to be plotted. Then we use .plot to call the method and pass in the two parameters 1) the kind of plot and 2) the title for the box plot.

2) DataFrame.plot.<kind>

The second way, to plot in pandas is to use .plot method and follow it directly by the kind of chart to be plotted. The <kind> is called the namespace attribute. A namespace is just a name for methods associated with an object, but are used in the form of a namespace to ensure they do not conflict with other names that might be already used by other objects.

The format for using the plot method is:

DataFrame.plot.<kind>() where kind is the type of chart. These includes area, bar, etc. See the table below for the most popular chart types.

Method Description
DataFrame.plot.area(self[, x, y]) Draw a stacked area plot.
DataFrame.plot.bar(self[, x, y]) Vertical bar plot.
DataFrame.plot.barh(self[, x, y]) Make a horizontal bar plot.
DataFrame.plot.density(self[, bw_method, ind]) Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.hist(self[, by, bins]) Draw one histogram of the DataFrame’s columns.
DataFrame.plot.line(self[, x, y]) Plot Series or DataFrame as lines.
DataFrame.plot.pie(self, **kwargs) Generate a pie plot.
DataFrame.plot.scatter(self, x, y[, s, c]) Create a scatter plot with varying marker point size and color.
DataFrame.boxplot(self[, column, by, ax, …]) Make a box plot from DataFrame columns.
DataFrame.hist(data[, column, by, grid, …]) Make a histogram of the DataFrame’s column.
mydata.boxplot('Avg_Age_Students')

Let's work through a few examples with the MBA data. We need to import the data and rename the columns. Renaming columns is especially important for graphing. Spaces and long variable names make code messy and error prone. When importing the data, set the index_col="School". This will be helpful when visualize our data.

import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0, index_col ="School") #add header and index
mydata.columns = ['Rank', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students', 
                   'Avg_Work_Experience', 'Tuition', 'Duration']

6.2.1 Box plot

Let's examine the distribution of our data. Box plots are effective at showing measures of central tendency. To create a box plot of one variable, use the .boxplot method and pass in the variable name. The variable must be a numeric data type such as an int or float.

mydata['Tuition'].plot(kind="box", title="Box plot of tuition of the top 25 MBA programs")

The alternative to using the .plot(kind="box") method is to use the pandas .boxplot method.

mydata.boxplot('Tuition')

To create a box plot of multiple variables, you simply pass in a several variables such as the average student and average work experience as shown below:

mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'])

Formatting with .boxplot().

There are more formatting capabilities with the .boxplot over .plot(kind="box") such as like suppressing the grid using grid=False or changing the fontsize using fontsize=14:

mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'], grid=False, fontsize=14)

6.2.2 Histograms

If we want to plot a simple Histogram based on a single column, we can call .plot on a column of data and set kind='hist'. Notice the bins=4 parameter. This specifies number of bars i.e., bins the values should arranged into.

mydata['Tuition'].plot(kind='hist', title='Tuition', bins=4)

6.2.3 Scatter plots

Is there a relationship between rank and the percentage of graduates that have jobs? We can show the answer to this question using a scatter plot.

mydata.plot(kind='scatter', x='Rank', y='Grad_Jobs', title='MBA Program Rank vs Percentage of graduates with jobs', xlim=0)

6.2.4 Bar charts

Let's continue to use the same MBA ranking data set to explore the plot capabilities pandas plot().

We'll start with a basic bar chart. What should we show? Well, upon examination of the data, perhaps we just want to show the tuition for each of the top 5 schools plotted slide by side. This can allow for easy comparison and we'll easily be able to see which of the top 5 schools are the least and most expensive (in $USD).

# out of the box - sorts in descending order
mydata.plot(kind='barh', y='Tuition', title='Total tuition(USD) for the top MBA programs')

Refine the bar chart with a few aesthetic elements.

  • importing rcParams from matplotlib allows for the use of .update method use the autolayout feature for graphics. This helps reduce labels from being cutoff.
  • The legend parameter is set to False to remove the redundant legend.
  • The color parameter can set the color values using RGB. RGB is a way of making colors. You have to provide an amount of red, green and blue + the transparency and it returns a color.
  • The edgecolor parameter allows you to set the border of the bars.
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True}) # keeps labels cutoff
mydata.head().plot(kind='barh', y='Tuition',title='Total tuition(USD) for the top 5 ranked MBA programs', legend=False, color=(0.2, 0.4, 0.6, 0.6), edgecolor='white')

6.2.5 Line charts

Line charts are the most common way to visualize time series data. Let's say we wanted to see the changing in the price of a three securities from May 1-May 31: Amazon, Google, and Facebook.

Download the data from: http://becomingvisual.com/python4data/timeseries_stockprice.csv

Let's begin by importing the data.

stock = pd.read_csv("http://becomingvisual.com/python4data/timeseries_stockprice.csv", header=0) #add header

Next, view the data to understand it's structure and variable names.

stock
       Date    Amazon  Google  Facebook
0    5/1/18   927.800  901.94   151.740
1    5/2/18   946.645  909.62   153.340
2    5/3/18   946.000  914.86   153.600
3    5/4/18   944.750  926.07   150.170
4    5/7/18   940.520  933.54   151.450
5    5/8/18   940.950  926.12   150.710
6    5/9/18   952.800  936.95   151.490
7   5/10/18   953.500  931.98   150.230
8   5/11/18   945.110  925.32   150.310
9   5/14/18   954.500  931.53   150.400
10  5/15/18   958.730  932.95   150.170
11  5/16/18   961.000  940.00   150.110
12  5/17/18   954.700  935.67   148.000
13  5/18/18   944.800  921.00   144.720
14  5/21/18   962.840  931.47   148.445
15  5/22/18   964.000  935.00   148.080
16  5/23/18   975.020  947.92   148.520
17  5/24/18   976.000  952.98   148.510
18  5/25/18   984.850  957.33   150.300
19  5/29/18   995.000  969.70   152.230
20  5/30/18   996.510  970.31   151.970
21  5/31/18  1000.000  975.02   152.700

Plotting a single series

To plot a single series, such as the closing stock price for Amazon for each date, you can simple designate the x-axis values as Date and the y-axis values as Amazon and set the kind parameter to line.

stock.plot(kind="line", x="Date", y="Amazon")

The alternative way to plot is line is shown below:

stock.plot.line(x="Date", y="Amazon")

Plotting two series

When plotting two series, set the y-axis values to a list containing the two column names, each being a data series.

stock.plot(kind="line", x="Date", y=["Amazon", "Google"])

Plotting three series

When plotting three series, set the y-axis values to a list containing the three column names, each being a data series.

stock.plot(kind="line", x="Date", y=["Amazon", "Google", "Facebook"])

6.3 Building charts with matplotlib

Matplotlib is a python data visualization library. A module in the matplotlib is pyplot. Often you will see matplotlib.pyplot in code. The module provides an interface that allows you to implicitly and automatically create figures and axes to achieve the desired plot (Govani (2019)).

The way you build charts in matplotlib differs from pandas

A matplotlib figure can be categorized into several parts as below:

  • Figure: This is a whole figure which may contain one or more than one axes (plots). You can think of a Figure as a canvas which contains plots.

  • Axes: This is what we generally think of as a plot. A Figure can contain many Axes. It contains two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.

  • Axis: They are the number line like objects and take care of generating the graph limits.

  • Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection objects. Most Artists are tied to Axes

STEP 1 First you need to import matplotlib and the pyplot module.

STEP 2 Next you can call the figure() methods to construct / initialize the figure. You can think of a figure as a canvas that you can add your plots.

STEP 3 Next, you need to draw (or add) the axes to the figure.

You see that the add_subplot() function in itself also poses you with a challenge, because you see add_subplots(111). What does 111 mean? Well, 111 is equal to 1,1,1, which means that you actually give three arguments to add_subplot(). The three arguments designate the number of rows (1), the number of columns (1) and the plot number (1). So you actually make one subplot.

STEP 4 Plot the chart. In the example, below we are using the .scatter() method and passing in our x and y values.

STEP 5 Call the .show() method to show the plot.

# 1. Import the necessary packages and modules
import matplotlib.pyplot as plt

# 2. Create a Figure
fig = plt.figure()

# 3. Set up Axes
ax = fig.add_subplot(111)

# 4. Scatter the data
ax.scatter(mydata["Rank"], mydata["Grad_Jobs"])

# Show the plot
plt.show()

Customizing matplotlib charts

In addition to the steps above, you have the ability to customize your charts.

# Import the necessary packages and modules
import matplotlib.pyplot as plt

# Create a Figure
fig = plt.figure()

# Adding a source below the graph
fig.text(x=0, y=0, s='Source: Finanical Times (2012)', horizontalalignment='left',color='#524939')

# Set up Axes
ax = fig.add_subplot(111, frameon=False)
#axis_bgcolor='#FFFFFF'

# Set x and y limits
ax.set_xlim(0,30)
(0, 30)
ax.set_ylim(80,100)

# Add labels for x and y axes and a title
(80, 100)
ax.set_xlabel("Rank")
ax.set_ylabel("Job Rate")
ax.set_title("Relationship between MBA Program Ranking and Job Rate (%)")

#remove ticks
ax.tick_params(axis='both', which='both', length=0)

# Scatter the data
ax.scatter(mydata["Rank"], mydata["Grad_Jobs"], color="#4cbea3")

# Show the plot
plt.show()

Matplotlib chart types

The full list of chart types are available through the graphing library matplotlib can be found at: https://matplotlib.org/gallery/index.html

6.4 Summary

  • Data can be visualized in python using functions from the pandas, matplotlib, and seaborn libraries.
  • Common chart types for exploratory analysis include: histograms, box plots, and scatter plots.
  • Bar plots are useful for showing quantities by category.

Exercise 6.1

  1. Using the Big Mac data from July 2019, create a barh chart of the top 10 countries based on GDP

  2. Create another barh with dollar_price and dollar_ppp (requires 2 y-value paramters).

Exercise 6.2

Upload the Economist TV show data to a data frame called tv. This displays a list of TV shows played within the last 10 years with their corresponding season number. The data can be found at: http://becomingvisual.com/python4data/tv.csv

  1. Create a scatter plot displaying the relationship between average rating(x) and season number(y).

  2. Remove all shows that have been on for less than 7 seasons and re-plot the chart.

  3. Recreate this same chart using matplotlib

Assignment 6a

  1. Using the Economist TV data, create a new data frame that pulls in the historical ratings of Law and Order Special Victims Unit.

Download the data from http://becomingvisual.com/python4data/tv.csv

  1. Create a line chart of the data, add a title and change the color of the line.

  2. What happened in 2015? Pull in the ratings of all Crime, Drama, Mystery genres for this year. Where does Law & Order stack in comparison? (hint: find the mean rating)

Assignment 6b

The data project. Identify a data set. Create a report.

References

Govani, Killol. 2019. “Matplotlib Tutorial: Learn Basics of Python’s Powerful Plotting Library.” https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596.