Lesson 4 Objects and Data Structures II

Welcome to lesson 4. In this lesson, you will be introduced to the fundamentals of object-oriented programming. We'll discuss python's pre-installed library and a library called pandas. Next, you'll learn how to import and view external data sets.

Follow along with this tutorial by creating a new ipython notebook named lesson04.ipynb and entering the code snippets presented.

Outline

  • Object-oriented programming
  • Libraries
  • Pandas
  • DataFrames
  • Summary
  • Exercise 4.1
  • Exercise 4.2
  • Assignment 4

4.1 Object oriented programming

Python is an object oriented programming language (OO). The python code we have written up until this point has been in the form of procedural or top down programming. Essentially, we've written short programs in one file (notebook) that achieves a task (or two) in a step by step sequence.

In writing those programs, we have referenced and created many objects. We've been able to modify the attributes of those objects by using methods or functions. For example, we've created variables of type str. Strings are a type of object (and data type).

For example, myname = "Kristen" creates a str object named Kristen. You can use methods to alter the object such as myname.upper(). The variable declaration and assignment of the variable to a string created or instantiated the string object. Once an object is instantiated (from the string class) then you can use the methods to alter attributes or characteristics of the object.

To ground your understanding, review these four terms (TK (2017), para 1). We'll begin to use them more frequently throughout the course.

Terminology

  1. Objects: a representation of the real world objects like books, cars, cats, etc. The objects share two main characteristics: data and behavior.

  2. Attributes: the data or information about the object. For example, cars have data like number of wheels, number of doors, seating capacity. An attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.

  3. Methods: the behavior of an object. For example, cars accelerate, stop, and show how much fuel is missing. In this course, we've been using the term function. For our purposes, let's assume methods are synonymous with functions.

  4. Class: a blueprint for an object that defines the attributes and behaviors (methods). An object is an instance of a class. In the real world we often find many objects with all the same type. Like cars. All the same make and model (have an engine, wheels, doors, etc). Each car was built from the same set of blueprints and has the same components.

As you progress in your learning of python, you will build your own classes that define the attributes and behaviors of objects. For now, let's use classes from python libraries that already exist to build objects.

4.2 Libraries

"The Python library contains several different kinds of components. It contains data types that would normally be considered part of the core of a language, such as numbers and lists. For these types, the Python language core defines the form of literals and places some constraints on their semantics, but does not fully define the semantics. (On the other hand, the language core does define syntactic properties like the spelling and priorities of operators.) The library also contains built-in functions and exceptions — objects that can be used by all Python code without the need of an import statement" (Python Software Foundation, 2019, para 1.)

In other words, in python there is a set of built-in functions. You can find the full list here: https://docs.python.org/3/library/functions.html

Check out the python standard library: https://docs.python.org/3/library/.

However, python is a growing language that has many contributors creating libraries of new classes that go beyond the set of built in functionality. In addition to those libraries that come pre-installed, these are additional libraries that can be installed and used. To use these libraries (after installation) you must reference them explicitly using the import statement.

4.3 Pandas

A library that we will use extensively in this course is called pandas. It is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas is already installed on Google Colab. To reference it, we need to use the import statement to import the pandas library. We import it as pd. pd is our pandas object.

Try it.

import pandas as pd

You can use any name you wish, within the rules of python. For example, you could write:

import pandas as tomatoes

The name you choose becomes a pandas object that you can reference to access relevant methods and functions that only work on pandas objects.

Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. For example, say you want to explore a data set stored in a .csv file on your computer. CSV stands for comma separated values. Pandas will extract the data from that .csv into a DataFrame — a table, basically — then let you do things like:

  • Calculate statistics and answer questions about the data, such as

  • What's the average, median, max, or min of each column?
  • Does column A correlate with column B?
  • What does the distribution of data in column C look like?
  • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
  • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
  • Store the cleaned, transformed data back into a CSV, other file or database (https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/, para 7)

Check out the pandas API documentation that provides a reference and explanation of all the public pandas objects, functions and methods. https://pandas.pydata.org/pandas-docs/stable/reference/index.html

4.4 DataFrames

A DataFrame is a two dimensional tabular, column-oriented data structure with both row and column labels (McKinney, p.4) that comes from the pandas library.

In python, we can import and store data as a DataFrame object. We can read a .csv file into python using the read_csv() method that comes from the pandas library.

Since we imported pandas as pd, we can call the read_csv method using our newly created pandas object, pd. The read_csv() method takes several parameters, with only one parameter as being required which is the location of the csv file (i.e filepath). We assign the variable mydata to the DataFrame that was created.

See the example below.

DO NOT TRY THIS, YET...

import pandas as pd
mydata = pd.read_csv("mba.csv") #basic import
type(mydata)
<class 'pandas.core.frame.DataFrame'>
mydata
    Rank                         School  ... Total Tuition ($)  Duration (Months)
0      1                Chicago (Booth)  ...            106800                 21
1      2               Dartmouth (Tuck)  ...            106980                 21
2      3              Virginia (Darden)  ...            107800                 21
3      4                        Harvard  ...            107000                 18
4      5                       Columbia  ...            111736                 20
5      6  California At Berkeley (Haas)  ...            106792                 21
6      7                    MIT (Sloan)  ...            116400                 22
7      8                       Stanford  ...            114600                 21
8      9                           IESE  ...             95610                 19
9     10                            IMD  ...             67416                 11
10    11               New York (Stern)  ...             96640                 20
11    12                         London  ...             92144                 15
12    13         Pennsylvania (Wharton)  ...            107852                 21
13    14                      HEC Paris  ...             66802                 16
14    15              Cornell (Johnson)  ...            107592                 21
15    16                York (Schulich)  ...             61800                  8
16    17       Carnegie Mellon (Tepper)  ...            108272                 21
17    18                          ESADE  ...             81693                 12
18    19                         INSEAD  ...             80719                 10
19    20         Northwestern (Kellogg)  ...            113100                 22
20    21               Emory (Goizueta)  ...             87200                 22
21    22                             IE  ...             82389                 13
22    23                UCLA (Anderson)  ...            105160                 21
23    24                Michigan (Ross)  ...            105500                 20
24    25                           Bath  ...             36057                 12

[25 rows x 11 columns]

You can read the full read_csv() documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table

With all data files, it's important to understand how the data is organized. All variables should be columns and all observations should be stored as rows. Usually, csv files have a header. A header is name of the column names. This is typically the first row of data. Therefore, as best practice, you should specify header=0 (since python starts counting at 0), to indicate that the first row of data contains the variable names.

import pandas as pd
mydata = pd.read_csv("mba.csv", header=0) #add header
mydata
    Rank                         School  ... Total Tuition ($)  Duration (Months)
0      1                Chicago (Booth)  ...            106800                 21
1      2               Dartmouth (Tuck)  ...            106980                 21
2      3              Virginia (Darden)  ...            107800                 21
3      4                        Harvard  ...            107000                 18
4      5                       Columbia  ...            111736                 20
5      6  California At Berkeley (Haas)  ...            106792                 21
6      7                    MIT (Sloan)  ...            116400                 22
7      8                       Stanford  ...            114600                 21
8      9                           IESE  ...             95610                 19
9     10                            IMD  ...             67416                 11
10    11               New York (Stern)  ...             96640                 20
11    12                         London  ...             92144                 15
12    13         Pennsylvania (Wharton)  ...            107852                 21
13    14                      HEC Paris  ...             66802                 16
14    15              Cornell (Johnson)  ...            107592                 21
15    16                York (Schulich)  ...             61800                  8
16    17       Carnegie Mellon (Tepper)  ...            108272                 21
17    18                          ESADE  ...             81693                 12
18    19                         INSEAD  ...             80719                 10
19    20         Northwestern (Kellogg)  ...            113100                 22
20    21               Emory (Goizueta)  ...             87200                 22
21    22                             IE  ...             82389                 13
22    23                UCLA (Anderson)  ...            105160                 21
23    24                Michigan (Ross)  ...            105500                 20
24    25                           Bath  ...             36057                 12

[25 rows x 11 columns]

Using Google Colab to import data in python

Before attempting to replicate the code above, you will need to download the mba.csv file from (http://becomingvisual.com/python4data/mba.csv) to your computer.

Next, implement the one of the options below to read in the mba.csv data.

Option 1: Read in file directly from a URL

import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0)

Option 2: Upload a file and then read it in

##Code to allow you to directly upload your .csv
from google.colab import files
uploaded = files.upload()

##Modify read_csv to reference the .csv uploaded above
import pandas as pd
import io
mydata = pd.read_csv(io.BytesIO(uploaded['mba.csv']), header=0)

Viewing your data

Once your have imported your data, take a look at it. You can see a preview by typing in the name of your DataFrame mydata.

mydata
    Rank                         School  ... Total Tuition ($)  Duration (Months)
0      1                Chicago (Booth)  ...            106800                 21
1      2               Dartmouth (Tuck)  ...            106980                 21
2      3              Virginia (Darden)  ...            107800                 21
3      4                        Harvard  ...            107000                 18
4      5                       Columbia  ...            111736                 20
5      6  California At Berkeley (Haas)  ...            106792                 21
6      7                    MIT (Sloan)  ...            116400                 22
7      8                       Stanford  ...            114600                 21
8      9                           IESE  ...             95610                 19
9     10                            IMD  ...             67416                 11
10    11               New York (Stern)  ...             96640                 20
11    12                         London  ...             92144                 15
12    13         Pennsylvania (Wharton)  ...            107852                 21
13    14                      HEC Paris  ...             66802                 16
14    15              Cornell (Johnson)  ...            107592                 21
15    16                York (Schulich)  ...             61800                  8
16    17       Carnegie Mellon (Tepper)  ...            108272                 21
17    18                          ESADE  ...             81693                 12
18    19                         INSEAD  ...             80719                 10
19    20         Northwestern (Kellogg)  ...            113100                 22
20    21               Emory (Goizueta)  ...             87200                 22
21    22                             IE  ...             82389                 13
22    23                UCLA (Anderson)  ...            105160                 21
23    24                Michigan (Ross)  ...            105500                 20
24    25                           Bath  ...             36057                 12

[25 rows x 11 columns]

Alternatively, you can use a few functions such as .head() or .tail() to show the beginning and ending of the file respectively.

The .head() method by default outputs the first five rows of your DataFrame. You call the methods by using the the mydata object followed by a period and then the function name. See below.

mydata.head() # outputs the first five rows
   Rank             School  ... Total Tuition ($)  Duration (Months)
0     1    Chicago (Booth)  ...            106800                 21
1     2   Dartmouth (Tuck)  ...            106980                 21
2     3  Virginia (Darden)  ...            107800                 21
3     4            Harvard  ...            107000                 18
4     5           Columbia  ...            111736                 20

[5 rows x 11 columns]

.head() outputs the first five rows of your DataFrame by default, but we could also pass a number as well: mydata.head(10) would output the top ten rows, for example.

mydata.head(10) # outputs the first ten rows
   Rank                         School  ... Total Tuition ($)  Duration (Months)
0     1                Chicago (Booth)  ...            106800                 21
1     2               Dartmouth (Tuck)  ...            106980                 21
2     3              Virginia (Darden)  ...            107800                 21
3     4                        Harvard  ...            107000                 18
4     5                       Columbia  ...            111736                 20
5     6  California At Berkeley (Haas)  ...            106792                 21
6     7                    MIT (Sloan)  ...            116400                 22
7     8                       Stanford  ...            114600                 21
8     9                           IESE  ...             95610                 19
9    10                            IMD  ...             67416                 11

[10 rows x 11 columns]

To see the last five rows use .tail(). tail() also accepts a number, and in this case we printing the last three rows.

mydata.tail(3) # outputs the last three rows
    Rank           School  ... Total Tuition ($)  Duration (Months)
22    23  UCLA (Anderson)  ...            105160                 21
23    24  Michigan (Ross)  ...            105500                 20
24    25             Bath  ...             36057                 12

[3 rows x 11 columns]

Getting info about your data

The .info() function provides the essential details about your data set, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

mydata.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 11 columns):
Rank                                   25 non-null int64
School                                 25 non-null object
Country                                25 non-null object
AvgSalary                              25 non-null int64
PreSalary                              25 non-null int64
GradJobs                               25 non-null int64
PhD                                    25 non-null int64
Avg. Age of Students                   25 non-null int64
Avg. work exp. of students (Months)    25 non-null int64
Total Tuition ($)                      25 non-null int64
Duration (Months)                      25 non-null int64
dtypes: int64(9), object(2)
memory usage: 2.2+ KB

A fast and useful attribute is .shape, which outputs just a tuple that shows the number of rows and columns:

mydata.shape
(25, 11)

Notice that we use .shape without closing parentheses. This is because we are accessing an attribute of our DataFrame object. As a reminder, an attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.

Accessing and renaming columns

Many times data sets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

Here's how to print the column names of our data set:

The attribute .columns come in handy if you want to see the column names or rename columns by allowing for simple copy and paste.

We can set a list of names to the columns to rename them.

mydata.columns # show original column names
Index(['Rank', 'School', 'Country', 'AvgSalary', 'PreSalary', 'GradJobs',
       'PhD', 'Avg. Age of Students', 'Avg. work exp. of students (Months)',
       'Total Tuition ($)', 'Duration (Months)'],
      dtype='object')
mydata.columns = ['Rank', 'School', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students', 
                     'Avg_Work_Experience', 'Tuition', 'Duration']

What if we wanted to make all the column names lowercase? We could repeat what we did above, but there's an easier way. Instead of just renaming each column manually we can do a list comprehension ( a more advanced concept, but try the code below anyway):

mydata.columns = [col.lower() for col in mydata]
mydata.columns
Index(['rank', 'school', 'country', 'average_salary', 'pre_salary',
       'grad_jobs', 'phd', 'avg_age_students', 'avg_work_experience',
       'tuition', 'duration'],
      dtype='object')

The code [col.lower() for col in mydata] is quite simple. We create an string object named col. You can call it pineapples; the name doesn't matter. Then we reference the .lower() method from the string class. The for is an iterator that allows us to traverse the column names in the DataFrame and change them to lowercase lettering.

Aside: List comprehensions are used for creating new list from other iterables. As list comprehension returns list, they consists of brackets containing the expression which needs to be executed for each element along with the for loop to iterate over each element.The square brackets signifies that the output is a list.

Accessing and renaming rows

Now that we looked at how to access and rename columns, let's look at the rows in our data.

To see the names of the rows we look at the index. The index, by default, begins at zero. Therefore the first row index will be 0.

mydata.index.values will return the names of your index (or rows). Let's try it.


mydata.index #returns the range of index values
RangeIndex(start=0, stop=25, step=1)
mydata.index.values # returns the index values
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

To rename the index, you can 1) set another column to be the index such as Rank' when your read in the data. or 2) Manually assign a list of equal length the index by updating it using.index`.

Setting another column as the index when you read in data


mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0, index_col="Rank") #set index to Rank. Note since we reimported our data, rank is now in mixed case lettering, not lower case. 

#change to lowercase
mydata.columns = [col.lower() for col in mydata]
print(mydata.index.values)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]

mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0) 
#change to lowercase
mydata.columns = [col.lower() for col in mydata]

mydata.index =mydata["rank"]

print(mydata.index.values)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]
print(mydata.loc[1,:]) #prints the first row and all columns
rank                                                 1
school                                 Chicago (Booth)
country                                             US
avgsalary                                       113217
presalary                                           63
gradjobs                                            93
phd                                                 96
avg. age of students                                27
avg. work exp. of students (months)                 60
total tuition ($)                               106800
duration (months)                                   21
Name: 1, dtype: object

Updating the index using .index.

mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0) #change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
mydata.index.values
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25])

Note: Values for each row should be unique. DataFrame slicing by column

Just as we did with Lists in lesson 3, we can extract subsets of a DataFrame.

You can extract a single column from mydata by using square brackets like this:

school_col = mydata['school'] #use lowercase 

When we extract columns using the square brackets the result is a new data structure called a Series.

type(school_col)
<class 'pandas.core.series.Series'>

"A Series is a one-dimensional array-like object containing an array of data (of NumPy data type) and associated array of data labels, called its index" (McKinney, 2013, p. 112). Series are similar to Lists.

To extract a column as a DataFrame, you need to pass a list of column names. In our case that's just a single column:

school_col = mydata[['school']] 
type(school_col)
<class 'pandas.core.frame.DataFrame'>

To extract more than on column, simple add it to the List of items to extract.

subset_mydata = mydata[['school', 'rank']]
subset_mydata.head()
              school  rank
1    Chicago (Booth)     1
2   Dartmouth (Tuck)     2
3  Virginia (Darden)     3
4            Harvard     4
5           Columbia     5

DataFrame extracting by row

For rows, we have two options:

  • .loc - locates the row by name (the name is always the index unless you assign it.)
  • .iloc- locates by numerical index

Let's try using .loc. Below we use the number 10 to reference the school ranked 11th.

nyu = mydata.loc[10]
nyu
rank                                       10
school                                    IMD
country                                Switz.
avgsalary                              145264
presalary                                  77
gradjobs                                   95
phd                                       100
avg. age of students                       31
avg. work exp. of students (months)        84
total tuition ($)                       67416
duration (months)                          11
Name: 10, dtype: object

To extract the 11th row, we would use 10 as the index.

nyu = mydata.iloc[10]
nyu
rank                                                 11
school                                 New York (Stern)
country                                              US
avgsalary                                        105798
presalary                                            49
gradjobs                                             93
phd                                                 100
avg. age of students                                 27
avg. work exp. of students (months)                  55
total tuition ($)                                 96640
duration (months)                                    20
Name: 11, dtype: object

To select the rows 3 (index 2) through 5 (index 4) use the colon. Note, just as with Lists, the range is exclusive of the last value (e.g. 2:5). Therefore, to extract index 2 through 4, 2:5 must be the parameter.

afewrows = mydata.iloc[2:5]
afewrows
   rank             school  ... total tuition ($)  duration (months)
3     3  Virginia (Darden)  ...            107800                 21
4     4            Harvard  ...            107000                 18
5     5           Columbia  ...            111736                 20

[3 rows x 11 columns]

The loc() function is inclusive of the last value in the range.

afewrows = mydata.loc[1:5]
afewrows
   rank             school  ... total tuition ($)  duration (months)
1     1    Chicago (Booth)  ...            106800                 21
2     2   Dartmouth (Tuck)  ...            106980                 21
3     3  Virginia (Darden)  ...            107800                 21
4     4            Harvard  ...            107000                 18
5     5           Columbia  ...            111736                 20

[5 rows x 11 columns]

Adding a new column to a DataFrame

To add a new column to an existing DataFrame use the square brackets to define the new column and assign it the values to the column as shown below.


mydata["newcolumn"] = 0 # every observation for the column will be zero.

4.5 Summary

  • Python is an object oriented programming language (OO).Objects are created from classes. A classe define the methods, data, and attributes related to ano object of that class type.
  • Data files can be read into python as a DataFrame using the read_csv() method from the pandas library.
  • DataFrame is data structure for storing data in a table like format
  • DataFrame comes from the pandas library.
  • The key methods for working deriving information about a DataFrame include: .info, .shape, ,.columns, .index, .index.values, and .dtypes.
  • Slicing DataFrames convert the new subset into a data structure of type Series when single square brackets are used as in subset = mydf[1]
  • Use double square brackets (i.e. an index and a list) to subset DataFrames and return objects of type DataFrame as in subset = mydf[[1]].
  • Use the iloc function to extract (or slice) a range of rows from a DataFrame based on the index. The iloc function is exclusive of the last element. Alternatively, use the loc function to extract (or slice) a range of rows from a DataFrame based on the row name. The loc function is inclusive of the last element, however.
  • In most cases the row name will be the same as the index. The index can be changed by reassignment using mydf.index or by specifying the index upon reading in the data.

Resources

Python Software Foundation (2019). Introduction. Available at: https://docs.python.org/3/library/intro.html

Python Software Foundation (2019): Functions. Available at: https://docs.python.org/3/library/functions.html

https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/

Basics of OO http://openbookproject.net/thinkcs/python/english3e/classes_and_objects_I.html

Exercise 4.1

  1. Download the portfolio.csv file from http://becomingvisual.com/python4data/portfolio.csv and import it into Python as a DataFrame named portfolio.
  2. Use the appropriate method to show the data types for each column.
  3. Use the appropriate method to show the number of columns and rows in the portfolio DataFrame. Write a pretty sentence that uses this data that reads: There are 14 rows and 4 columns in the portfolio DataFrame.
  4. Rename the first column in the DataFrame to Security.

Exercise 4.2

  1. You decide to purchase 5 more shares of V, change the number of V shares to 55 (hint: row 9 of the Shares column df.loc[row #, ‘column name’]=new value)

  2. Add a new column to your DataFrame that calculates the’ market_value’ of each security (hint: portfolio.Price*portfolio.Shares)

  3. Use the .sum() function to sum the total of your new market_value column to find the value of your portfolio and call this variable ‘total_value’

Assignment 4

  1. Download the windenergy.csv file from http://becomingvisual.com/python4data/windenergy.csv and import it into python. This data breaks down utility scale of wind turbines per state in the US.

  2. Print out the top 10 states utilizing the most wind power by rank.

  3. Create a new DataFrame that pulls in State, Ranking, Installed Capacity, Num of Turbines, and Homes Powered

  4. Print out the data for our great state of New York!

References