# Lesson 6 Visualization and data exploration

Welcome to lesson 6. In this lesson, you will be introduced basic data exploration and visualization techniques.

Follow along with this tutorial by creating a new ipython notebook named lesson06.ipynb and entering the code snippets presented.

Visualizing data is a useful technique for 1) data exploration and 2) data presentation. These two techniques require different approaches building data visualizations. Visualizations that we use to simply explore our data (and keep to ourselves) do not require as much attention to the aesthetic elements such as well-described data axes, key takeaway, and extraneous chart junk e.g. grid lines, tick marks, and borders.

**Outline**

- Chart types
- Building charts with pandas
- Building charts with matplotlib
- Summary
- Exercise 6.1
- Exercise 6.2
- Assignment 6

## 6.1 Chart types

The type of chart you select to show your data is dictated by the type of data that you are working with. For example, line charts require the use of time-series data such as year, month, day, week, or minute. Without the right data, you may still be able to graph the data, but not in a form that is easy to interpret.

We’ll work with the following data and chart types described in the table below.

Data | Chart types |
---|---|

Statistical | histogram, box plot, scatter plot |

Categorical | vertical or horizontal bar (To show rank) |

Time series | line or area (To show proportional change over time) |

There are numerous types of charts types available in python through the `pandas`

, matplotlib, and `seaborn`

libraries. In this lesson we’ll focus on using pandas to create statistical, categorical, and time series charts and the basics of the `matplotlib`

library.

## 6.2 Building charts with pandas

Since we’ve been working with the `pandas`

library to import and format data, let’s continue to use it to plot our data.

There are two ways to build charts using from pandas’ DataFrame class: 1)`DataFrame.plot(kind)`

and 2) `DataFrame.plot.<kind>`

DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods in the form `DataFrame.plot.<kind>`

.

**1) DataFrame.plot(kind)**

The first way, the .plot method is used with the `kind`

of chart specified as a parameter.

The format for using the plot method is:

`DataFrame[].plot(kind,….)`

It has several key parameters:

Parameter | Options |
---|---|

kind | `bar` ,`barh` ,`pie` ,`scatter` , etc. |

color | accepts an array of hex codes corresponding sequential to each data series / column. |

linestyle | ‘solid’, ‘dotted’, ‘dashed’ (applies to line graphs only) |

xlim, ylim | specify a tuple (lower limit, upper limit) for which the plot will be drawn |

legend | a Boolean value to display or hide the legend |

labels | a list corresponding to the number of columns in the DataFrame, a descriptive name can be provided here for the legend |

title | The string title of the plot |

Below is an example of how to use `.plot`

with a DataFrame using the template: `DataFrame[].plot(kind,x=….])`

`mydata["Tuition"].plot(kind="box",title="Box plot of tuition of the top 25 MBA programs")`

After the DataFrame named `mydata`

, we use the square brackets to reference the column to be plotted. Then we use `.plot`

to call the method and pass in the two parameters 1) the `kind`

of plot and 2) the `title`

for the box plot.

**2) DataFrame.plot.<kind>**

The second way, to plot in pandas is to use .plot method and follow it directly by the `kind`

of chart to be plotted. The `<kind>`

is called the namespace attribute. A namespace is just a name for methods associated with an object, but are used in the form of a namespace to ensure they do not conflict with other names that might be already used by other objects.

The format for using the plot method is:

`DataFrame.plot.<kind>()`

where kind is the type of chart. These includes area, bar, etc. See the table below for the most popular chart types.

Method | Description |
---|---|

DataFrame.plot.area(self[, x, y]) | Draw a stacked area plot. |

DataFrame.plot.bar(self[, x, y]) | Vertical bar plot. |

DataFrame.plot.barh(self[, x, y]) | Make a horizontal bar plot. |

DataFrame.plot.density(self[, bw_method, ind]) | Generate Kernel Density Estimate plot using Gaussian kernels. |

DataFrame.plot.hist(self[, by, bins]) | Draw one histogram of the DataFrame’s columns. |

DataFrame.plot.line(self[, x, y]) | Plot Series or DataFrame as lines. |

DataFrame.plot.pie(self, **kwargs) | Generate a pie plot. |

DataFrame.plot.scatter(self, x, y[, s, c]) | Create a scatter plot with varying marker point size and color. |

DataFrame.boxplot(self[, column, by, ax, …]) | Make a box plot from DataFrame columns. |

DataFrame.hist(data[, column, by, grid, …]) | Make a histogram of the DataFrame’s column. |

`mydata.boxplot('Avg_Age_Students')`

Let’s work through a few examples with the MBA data. We need to import the data and rename the columns. Renaming columns is especially important for graphing. Spaces and long variable names make code messy and error prone. When importing the data, set the `index_col="School"`

. This will be helpful when visualize our data.

```
import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0, index_col ="School") #add header and index
mydata.columns = ['Rank', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students',
'Avg_Work_Experience', 'Tuition', 'Duration']
```

### 6.2.1 Box plot

Let’s examine the distribution of our data. Box plots are effective at showing measures of central tendency. To create a box plot of one variable, use the `.boxplot`

method and pass in the variable name. The variable must be a numeric data type such as an `int`

or `float`

.

`mydata['Tuition'].plot(kind="box", title="Box plot of tuition of the top 25 MBA programs")`

The alternative to using the `.plot(kind="box")`

method is to use the pandas .boxplot method.

`mydata.boxplot('Tuition')`

To create a box plot of multiple variables, you simply pass in a several variables such as the average student and average work experience as shown below:

`mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'])`

**Formatting with .boxplot()**.

There are more formatting capabilities with the .boxplot over .plot(kind=“box”) such as like suppressing the grid using `grid=False`

or changing the fontsize using `fontsize=14`

:

`mydata.boxplot(column=['Avg_Age_Students','Avg_Work_Experience'], grid=False, fontsize=14)`

### 6.2.2 Histograms

If we want to plot a simple Histogram based on a single column, we can call `.plot`

on a column of data and set `kind='hist'`

. Notice the `bins=4`

parameter. This specifies number of bars i.e., bins the values should arranged into.

`mydata['Tuition'].plot(kind='hist', title='Tuition', bins=4)`

### 6.2.3 Scatter plots

Is there a relationship between `rank`

and the percentage of graduates that have jobs? We can show the answer to this question using a scatter plot.

`mydata.plot(kind='scatter', x='Rank', y='Grad_Jobs', title='MBA Program Rank vs Percentage of graduates with jobs', xlim=0)`

### 6.2.4 Bar charts

Let’s continue to use the same MBA ranking data set to explore the plot capabilities pandas `plot()`

.

We’ll start with a basic bar chart. What should we show? Well, upon examination of the data, perhaps we just want to show the tuition for each of the top 5 schools plotted slide by side. This can allow for easy comparison and we’ll easily be able to see which of the top 5 schools are the least and most expensive (in $USD).

```
# out of the box - sorts alphabetically
mydata.plot(kind='barh', y='Tuition', title='Total tuition(USD) for the top MBA programs')
```

Refine the bar chart with a few aesthetic elements.

- importing rcParams from matplotlib allows for the use of .update method use the autolayout feature for graphics. This helps reduce labels from being cutoff.
- The
`legend`

parameter is set to False to remove the redundant legend. - The
`color`

parameter can set the color values using RGB. RGB is a way of making colors. You have to provide an amount of red, green and blue + the transparency and it returns a color. - The
`edgecolor`

parameter allows you to set the border of the bars.

```
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True}) # keeps labels cutoff
mydata.head().plot(kind='barh', y='Tuition',title='Total tuition(USD) for the top 5 ranked MBA programs', legend=False, color=(0.2, 0.4, 0.6, 0.6), edgecolor='white')
```

### 6.2.5 Line charts

Line charts are the most common way to visualize time series data. Let’s say we wanted to see the changing in the price of a three securities from May 1-May 31: Amazon, Google, and Facebook.

Download the data from: http://becomingvisual.com/python4data/timeseries_stockprice.csv

Let’s begin by importing the data.

`stock = pd.read_csv("http://becomingvisual.com/python4data/timeseries_stockprice.csv", header=0) #add header`

Next, view the data to understand it’s structure and variable names.

`stock`

```
Date Amazon Google Facebook
0 5/1/18 927.800 901.94 151.740
1 5/2/18 946.645 909.62 153.340
2 5/3/18 946.000 914.86 153.600
3 5/4/18 944.750 926.07 150.170
4 5/7/18 940.520 933.54 151.450
5 5/8/18 940.950 926.12 150.710
6 5/9/18 952.800 936.95 151.490
7 5/10/18 953.500 931.98 150.230
8 5/11/18 945.110 925.32 150.310
9 5/14/18 954.500 931.53 150.400
10 5/15/18 958.730 932.95 150.170
11 5/16/18 961.000 940.00 150.110
12 5/17/18 954.700 935.67 148.000
13 5/18/18 944.800 921.00 144.720
14 5/21/18 962.840 931.47 148.445
15 5/22/18 964.000 935.00 148.080
16 5/23/18 975.020 947.92 148.520
17 5/24/18 976.000 952.98 148.510
18 5/25/18 984.850 957.33 150.300
19 5/29/18 995.000 969.70 152.230
20 5/30/18 996.510 970.31 151.970
21 5/31/18 1000.000 975.02 152.700
```

**Plotting a single series**

To plot a single series, such as the closing stock price for Amazon for each date, you can simple designate the x-axis values as `Date`

and the y-axis values as `Amazon`

and set the `kind`

parameter to `line`

.

`stock.plot(kind="line", x="Date", y="Amazon")`

The alternative way to plot is line is shown below:

`stock.plot.line(x="Date", y="Amazon")`

**Plotting two series**

When plotting two series, set the y-axis values to a list containing the two column names, each being a data series.

`stock.plot(kind="line", x="Date", y=["Amazon", "Google"])`

**Plotting three series**

When plotting three series, set the y-axis values to a list containing the three column names, each being a data series.

`stock.plot(kind="line", x="Date", y=["Amazon", "Google", "Facebook"])`

## 6.3 Building charts with matplotlib

Matplotlib is a python data visualization library. A module in the `matplotlib`

is `pyplot`

. Often you will see `matplotlib.pyplot`

in code. The module provides an interface that allows you to implicitly and automatically create figures and axes to achieve the desired plot (Govani (2019)).

The way you build charts in `matplotlib`

differs from `pandas`

A `matplotlib`

figure can be categorized into several parts as below:

Figure: This is a whole figure which may contain one or more than one axes (plots). You can think of a Figure as a canvas which contains plots.

Axes: This is what we generally think of as a plot. A Figure can contain many Axes. It contains two or three (in the case of 3D) Axis objects. Each Axes has a title, an x-label and a y-label.

Axis: They are the number line like objects and take care of generating the graph limits.

Artist: Everything which one can see on the figure is an artist like Text objects, Line2D objects, collection objects. Most Artists are tied to Axes

**STEP 1** First you need to import `matplotlib`

and the `pyplot`

module.

**STEP 2** Next you can call the `figure()`

methods to construct / initialize the figure. You can think of a figure as a canvas that you can add your plots.

**STEP 3** Next, you need to draw (or add) the axes to the figure.

You see that the add_subplot() function in itself also poses you with a challenge, because you see `add_subplots(111)`

. What does 111 mean? Well, 111 is equal to 1,1,1, which means that you actually give three arguments to add_subplot(). The three arguments designate the number of rows (1), the number of columns (1) and the plot number (1). So you actually make one subplot.

**STEP 4** Plot the chart. In the example, below we are using the .scatter() method and passing in our x and y values.

**STEP 5** Call the `.show()`

method to show the plot.

```
# 1. Import the necessary packages and modules
import matplotlib.pyplot as plt
# 2. Create a Figure
fig = plt.figure()
# 3. Set up Axes
ax = fig.add_subplot(111)
# 4. Scatter the data
ax.scatter(mydata["Rank"], mydata["Grad_Jobs"])
# Show the plot
plt.show()
```

**Customizing matplotlib charts**

In addition to the steps above, you have the ability to customize your charts.

```
# Import the necessary packages and modules
import matplotlib.pyplot as plt
# Create a Figure
fig = plt.figure()
# Adding a source below the graph
fig.text(x=0, y=0, s='Source: Finanical Times (2012)', horizontalalignment='left',color='#524939')
# Set up Axes
ax = fig.add_subplot(111, frameon=False)
#axis_bgcolor='#FFFFFF'
# Set x and y limits
ax.set_xlim(0,30)
```

`(0, 30)`

```
ax.set_ylim(80,100)
# Add labels for x and y axes and a title
```

`(80, 100)`

```
ax.set_xlabel("Rank")
ax.set_ylabel("Job Rate")
ax.set_title("Relationship between MBA Program Ranking and Job Rate (%)")
#remove ticks
ax.tick_params(axis='both', which='both', length=0)
# Scatter the data
ax.scatter(mydata["Rank"], mydata["Grad_Jobs"], color="#4cbea3")
# Show the plot
plt.show()
```

**Matplotlib chart types**

The full list of chart types are available through the graphing library `matplotlib`

can be found at: https://matplotlib.org/gallery/index.html

## 6.4 Summary

- Data can be visualized in python using functions from the
`pandas`

,`matplotlib`

, and`seaborn`

libraries. - Common chart types for exploratory analysis include: histograms, box plots, and scatter plots.
- Bar plots are useful for showing quantities by category.

## Exercise 6.1

Using the Big Mac data from July 2019, create a barh chart of the top 10 countries based on GDP

Create another barh with

`dollar_price`

and`dollar_ppp`

.

## Exercise 6.2

Upload the Economist TV show data to a data frame called tv. This displays a list of TV shows played within the last 10 years with their corresponding season number. The data can be found at: http://becomingvisual.com/python4data/tv.csv

Create a scatter plot displaying the relationship between average rating(x) and season number(y).

Remove all shows that have been on for less than 7 seasons and re-plot the chart.

Recreate this same chart using matplotlib

## Assignment 6

- Using the Economist TV data, create a new data frame that pulls in the historical ratings of Law and Order Special Victims Unit.

Download the data from http://becomingvisual.com/pythondata/tv.csv

Create a line chart of the data, add a title and change the color of the line.

What happened in 2015? Pull in the ratings of all Crime, Drama, Mystery genres for this year. Where does Law & Order stack in comparison? (hint: find the mean rating)

### References

Govani, Killol. 2019. “Matplotlib Tutorial: Learn Basics of Python’s Powerful Plotting Library.” https://towardsdatascience.com/matplotlib-tutorial-learn-basics-of-pythons-powerful-plotting-library-b5d1b8f67596.