Lesson 6 Visualization and data exploration

Welcome to lesson 6. In this lesson, you will be introduced basic data visualization and data exploration techniques.

Follow along with this tutorial by creating a new notebook in jupyterhub named lesson06.ipynb and entering the code snippets presented.

Visualizing data is a useful technique for 1) data exploration and 2) data presentation. These two purposes require different approaches to how we build the visualizations. Those that we use to simply explore our data (and keep to ourselves) do not require as much attention to the aesthetic elements such as well-balled data axes, key takeaway, and extraneous chart junk e.g. grid lines, tick marks, and borders. However, the type of chart used must be appropriate for the type of data and the use of colors.

Outline

  • Chart types
  • Building charts with pandas .plot
  • Builing charts with matplotlib

6.1 Chart types

The type of chart you select to show your data is dictated by the type of data that you are working with. For example, line charts require the use of time-series data such as year, month, day, week, or minute. Without the right data, you may still be able to graph the data, but not in an interpretable or correct form.

We’ll work with the following data and chart types described in the table below.

Data Chart types
Statistical histogram, box plot, scatter plot
Categorical vertical or horizontal bar (To show rank)
Time series line or area (To show proportional change over time)

There are numereous types of charts types available in python through the pandas, matplotlib, and seaborn libraries. In this lesson we’ll focus on using pandas to create statistical, categorical, and time series charts.

6.2 Building charts with pandas

Since we’ve been working with the pandas library, let’s continue to use pandas to plot our data.

There are two ways to build charts using from pandas’ DataFrame class: 1)DataFrame.plot(kind) and 2) DataFrame.plot.<kind>

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods in the form DataFrame.plot.. Simply, there are two ways to plot.

1) DataFrame.plot(kind)

The first way, the .plot method is used and the type or kind of chart is specified as a parameter along with others.

The format for using the plot method is:

DataFrame[].plot(kind,….)

It has several key parameters:

Parameter Options
kind bar,barh,pie,scatteretc.
color accepts an array of hex codes corresponding sequential to each data series / column.
linestyle ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only)
xlim, ylim specify a tuple (lower limit, upper limit) for which the plot will be drawn
legend a boolean value to display or hide the legend
labels a list corresponding to the number of columns in the dataframe, a descriptive name can be provided here for the legend
title The string title of the plot

Below is an example of how to use .plot with a DataFrame using the template: DataFrame[].plot(kind,x=….])

mydata["Tuition"].plot(kind="box",title="Box plot of tuition of the top 25 MBA programs")

After the DataFrame name mydata, we using the square brackets to reference the column to be plotted. Then we use .plot to call the method and pass in the two parameters 1) the kind of plot and 2) the title for the box plot.

2) DataFrame.plot.

The second way, the .plot method is used and is followed directly by the kind of chart to be plotted. The is called the namespace attribute. A namespace is just a name for methods associated with an object, but are used in the form of a namespace to ensure they do not conflict with other names that might be already used by other objects.

The format for using the plot method is:

DataFrame.plot.<kind>() where kind is the type of chart. These includes area, bar, etc. See the table belwo for the most popular chart types.

Method Description
DataFrame.plot.area(self[, x, y]) Draw a stacked area plot.
DataFrame.plot.bar(self[, x, y]) Vertical bar plot.
DataFrame.plot.barh(self[, x, y]) Make a horizontal bar plot.
DataFrame.plot.density(self[, bw_method, ind]) Generate Kernel Density Estimate plot using Gaussian kernels.
DataFrame.plot.hist(self[, by, bins]) Draw one histogram of the DataFrame’s columns.
DataFrame.plot.line(self[, x, y]) Plot Series or DataFrame as lines.
DataFrame.plot.pie(self, **kwargs) Generate a pie plot.
DataFrame.plot.scatter(self, x, y[, s, c]) Create a scatter plot with varying marker point size and color.
DataFrame.boxplot(self[, column, by, ax, …]) Make a box plot from DataFrame columns.
DataFrame.hist(data[, column, by, grid, …]) Make a histogram of the DataFrame’s column.
mydata.boxplot('Avg_Age_Students')

Let’s work through a few examples with the MBA data. We need to import the data and rename the columns. Renaming columns is especially important for graphing. Spaces and long variable names make code messy and error prone. When importing the data, set the index_col="School". This will be helpful when visualize our data.

import pandas as pd
mydata = pd.read_csv("mba.csv", header=0, index_col ="School") #add header and index
mydata.columns = ['Rank', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students', 
                   'Avg_Work_Experience', 'Tuition', 'Duration']

6.2.1 Box plot

Let’s examine how our data is distributed. Box plots are effective at showing measures of central tendancy. To create a box plot of one variable, using the .boxplot method and pass in the variable name. The variable must be a numeric data type such as an int or float.

mydata['Tuition'].plot(kind="box", title="Box plot of tuition of the top 25 MBA programs")

The alternative to using the .plot(kind="box") method is to use the pandas .boxplot method.

mydata.boxplot('Avg_Age_Students')

To create a box plot of multiple variables, pass in a several variables such as the average student and average work experience as shown below:

mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'])

Formatting with .boxplot().

There are more formatting capabilities with the .boxplot over .plot(kind=“box”) such as like suppressing the grid (grid=False) or changing the fontsize (i.e. fontsize=14):

mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'], grid=False, fontsize=14)

6.2.2 Histograms

If we want to plot a simple Histogram based on a single column, we can call .plot on a column of data and set kind='hist'. Notice the bins=4 parameter. This specifies number of bars i.e., bins the values should arranged into.

mydata['Tuition'].plot(kind='hist', title='Tuition', bins=4)

6.2.3 Scatter plots

Is there a relationship between rank and the percentage of graduates that have jobs? We can show the answer to this question using a scatter plot.

mydata.plot(kind='scatter', x='Rank', y='Grad_Jobs', title='MBA Program Rank vs Percentage of graduates with jobs')

You can refine the chart by setting the starting value of the x axis to 0 with xlim=0.

mydata.plot(kind='scatter', x='Rank', y='Grad_Jobs', title='MBA Program Rank vs Percentage of graduates with jobs', xlim=0)

6.2.4 Bar charts

Let’s continue to use the same MBA ranking data set to explore the plot capabilities pandas plot().

We’ll start with a basic bar chart. What should we show? Well, upon examination of the data, perhaps we just want to show the tuition for each of the top 5 schools plotted slide by side. This can allow for easy comparision and we’ll easily be able to see which of the top 5 schools is the least and most expenseive (in $USD).

# out of the box - sorts alphabetically
mydata.plot(kind='barh', y='Tuition', title='Total tuition(USD) for the top MBA programs')

Refine the bar chart with a few asthetic elements.

  • importing rcParams from matplotlib allows for the use of .update method use the autolayout feature for graphcs. This helps reduce labels from being cutoff.
  • The legend parameter is set to False to remove the redunant legend.
  • The color parameter with can set the color values using RGB. RGB is a way of making colors. You have to to provide an amount of red, green and blue + the transparency and it returns a color.
  • The edgecolor parameter allows you to set the border of the bars.
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True}) # keeps labels cutoff

mydata.head().plot(kind='barh', y='Tuition',title='Total tuition(USD) for the top 5 ranked MBA programs', legend=False, color=(0.2, 0.4, 0.6, 0.6), edgecolor='white')

6.2.5 Line charts

Line charts are the most common way to visualize time series data. Let’s say we wanted to the changing in the price of a three securities from May 1-May 31: Amazon, Google, and Facebook.

Download the data from: http://becomingvisual.com/python4data/timeseries_stockprice.csv

Let’s begin by importing the data.

stock = pd.read_csv("timeseries_stockprice.csv", header=0) #add header

Next, view the data to understand it’s structure and variable names.

stock
       Date    Amazon  Google  Facebook
0    5/1/18   927.800  901.94   151.740
1    5/2/18   946.645  909.62   153.340
2    5/3/18   946.000  914.86   153.600
3    5/4/18   944.750  926.07   150.170
4    5/7/18   940.520  933.54   151.450
5    5/8/18   940.950  926.12   150.710
6    5/9/18   952.800  936.95   151.490
7   5/10/18   953.500  931.98   150.230
8   5/11/18   945.110  925.32   150.310
9   5/14/18   954.500  931.53   150.400
10  5/15/18   958.730  932.95   150.170
11  5/16/18   961.000  940.00   150.110
12  5/17/18   954.700  935.67   148.000
13  5/18/18   944.800  921.00   144.720
14  5/21/18   962.840  931.47   148.445
15  5/22/18   964.000  935.00   148.080
16  5/23/18   975.020  947.92   148.520
17  5/24/18   976.000  952.98   148.510
18  5/25/18   984.850  957.33   150.300
19  5/29/18   995.000  969.70   152.230
20  5/30/18   996.510  970.31   151.970
21  5/31/18  1000.000  975.02   152.700

Plotting a single series

To plot a single series, such as the closing stock price for Amazon for each date, you can simple designate the x-axis values as Date and the y-axis values as Amazon and set the kind parameter to line.

stock.plot(kind="line", x="Date", y="Amazon")

The alternative way to plot is line is shown below:

stock.plot.line(x="Date", y="Amazon")

Plotting two series

When plotting two series, set the y-axis values to a list containing the two column names, each being a data series.

stock.plot(kind="line", x="Date", y=["Amazon", "Google"])

Plotting three series

When plotting three series, set the y-axis values to a list containing the three column names, each being a data series.

stock.plot(kind="line", x="Date", y=["Amazon", "Google", "Facebook"])

6.3 Builing charts with matplotlib

“Matplotlib is the the whole Python data visualization package. A module in the the matplotlib pacakage is pyplot. Often you will see matplotlib.pyplot in code. The module provides an interface that allows you to implicitly and automatically create figures and axes to achieve the desired plot.

The way you build charts in matplot lib is different from pandas.

A Matplotlib figure can be categorized into several parts as below: Figure: It is a whole figure which may contain one or more than one axes (plots). You can think of a Figure as a canvas which contains plots.

  • Axes: It is what we generally think of as a plot. A Figure can contain many Axes. It contains two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.

  • Axis: They are the number line like objects and take care of generating the graph limits.

  • Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection objects. Most Artists are tied to Axes. Killol Govani

https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596

First you need to import matplotlib and the pyplot module. Next you can call the figure() methods to construct / initialize the figure. You can think of a figure as a canvas that you can add your plots. Next, you need to draw (or add) the axes to the figure.

# Import `pyplot`
import matplotlib.pyplot as plt

# Initialize a Figure 
fig = plt.figure()

# Add Axes to the Figure
fig.add_axes([0,0,1,1])

# Import the necessary packages and modules
import matplotlib.pyplot as plt
import numpy as np

# Create a Figure
fig = plt.figure()

# Set up Axes
ax = fig.add_subplot(111)

# Scatter the data
ax.scatter(np.linspace(0, 1, 5), np.linspace(0, 5, 5))

# Show the plot
plt.show()

You see that the add_subplot() function in itsef also poses you with a challenge, because you see add_subplots(111) in the above code chunk.

What does 111 mean?

Well, 111 is equal to 1,1,1, which means that you actually give three arguments to add_subplot(). The three arguments designate the number of rows (1), the number of columns (1) and the plot number (1). So you actually make one subplot

THE END

mydata.head().index.value.value_counts().sort_values().plot(kind = 'barh', y='Tuition')

mydata.head().index.values.value_counts().reindex(["Low", "Medium", "High", "4", "5"]).plot(kind="barh", y='Tuition')
figsize=(10,10), 

df.letters.value_counts().sort_index(ascending=False).plot(kind='barh')


ax = df.plot.bar(rot=0)
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")
ax.set_title("Sample Plot")
ax.set_xticklabels(ax.get_xticklabels(), rotation=30, ha="right")
fig = ax.get_figure()


#sort

#top5=mydata.head()
#top5.sort_values("Rank", axis = 0, descending = True, 
 #                inplace = True, na_position ='first') 

#mydata.head().CatVar.value_counts().reindex(["Low", "Medium", "High"]).plot(kind="barh", y='Tuition')

#top5.plot(kind='barh', y='Tuition')

Grouping Doesn’t work…


mba = pd.read_csv("mba.csv", header=0) #add header
mba.columns = ['Rank', 'School','Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students', 
                   'Avg_Work_Experience', 'Tuition', 'Duration']
                   
mba.groupby('Country').plot(x='School', y='Tuition', kind="bar")
Country
Britain             Axes(0.125,0.1;0.775x0.8)
Canada              Axes(0.125,0.1;0.775x0.8)
France              Axes(0.125,0.1;0.775x0.8)
France/Singapore    Axes(0.125,0.1;0.775x0.8)
Spain               Axes(0.125,0.1;0.775x0.8)
Switz.              Axes(0.125,0.1;0.775x0.8)
US                  Axes(0.125,0.1;0.775x0.8)
dtype: object

Third, we build a horizontal bar plot.

df[‘colour’].value_counts().plot(kind=‘bar’)

6.4 Summary

  • Data can be visualized in python using functions from the pandas, matplotlib, and seaborn libraries.
  • Common chart types for exploratory analysis include: histograms, box plots, and scatter plots.
  • Bar plots are useful for showing quantities by category.

6.5 Quiz

1. Which chart type is best for showing change over time?

  1. Bar chart
  2. Line chart [CORRECT]
  3. Horizontal bar chart
  4. Box plot
  5. Histogram

Feedback: While bar charts can show change over time, line charts best show the variation in the data.

2. What is wrong with the following code to build a scatter plot:

myspecialdataframe.plot(kind='scatter', x='ad_spend' title='The relationship between advertising spend and revenue')

  1. Nothing
  2. kind=scatter is not a valid chart type
  3. The method call should be myspecialdataframe.scatter() not myspecialdataframe.plot()
  4. Missing the y-axis variable [CORRECT]

Feedback: When building scatter plots, an x value and a y value are required.

3.

4.

5.

6.

7.

8.

9.

10.

6.6 Exercises

6.6.1 Exercise 6.1

6.6.2 Exercise 6.2