Lesson 4 Objects and Data Structures II
Welcome to lesson 4. In this lesson, you will be introduced to the fundamentals of object-oriented programming. We'll discuss python's pre-installed library and a library called pandas. Next, you'll learn how to import and view external data sets.
Follow along with this tutorial by creating a new ipython notebook named lesson04.ipynb and entering the code snippets presented.
Outline
- Object-oriented programming
- Libraries
- Pandas
- DataFrames
- Summary
- Exercise 4.1
- Exercise 4.2
- Assignment 4
4.1 Object oriented programming
Python is an object oriented programming language (OO). The python code we have written up until this point has been in the form of procedural or top down programming. Essentially, we've written short programs in one file (notebook) that achieves a task (or two) in a step by step sequence.
In writing those programs, we have referenced and created many objects. We've been able to modify the attributes of those objects by using methods or functions. For example, we've created variables of type str. Strings are a type of object (and data type).
For example, myname = "Kristen"
creates a str object named Kristen. You can use methods to alter the object such as myname.upper()
. The variable declaration and assignment of the variable to a string created or instantiated the string object. Once an object is instantiated (from the string class) then you can use the methods to alter attributes or characteristics of the object.
To ground your understanding, review these four terms (TK (2017), para 1). We'll begin to use them more frequently throughout the course.
Terminology
Objects: a representation of the real world objects like books, cars, cats, etc. The objects share two main characteristics: data and behavior.
Attributes: the data or information about the object. For example, cars have data like number of wheels, number of doors, seating capacity. An attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.
Methods: the behavior of an object. For example, cars accelerate, stop, and show how much fuel is missing. In this course, we've been using the term
function
. For our purposes, let's assume methods are synonymous with functions.Class: a blueprint for an object that defines the attributes and behaviors (methods). An object is an instance of a class. In the real world we often find many objects with all the same type. Like cars. All the same make and model (have an engine, wheels, doors, etc). Each car was built from the same set of blueprints and has the same components.
As you progress in your learning of python, you will build your own classes that define the attributes and behaviors of objects. For now, let's use classes from python libraries that already exist to build objects.
4.2 Libraries
"The Python library contains several different kinds of components. It contains data types that would normally be considered part of the core of a language, such as numbers and lists. For these types, the Python language core defines the form of literals and places some constraints on their semantics, but does not fully define the semantics. (On the other hand, the language core does define syntactic properties like the spelling and priorities of operators.) The library also contains built-in functions and exceptions — objects that can be used by all Python code without the need of an import statement" (Python Software Foundation, 2019, para 1.)
In other words, in python there is a set of built-in functions. You can find the full list here: https://docs.python.org/3/library/functions.html
Check out the python standard library: https://docs.python.org/3/library/.
However, python is a growing language that has many contributors creating libraries of new classes that go beyond the set of built in functionality. In addition to those libraries that come pre-installed, these are additional libraries that can be installed and used. To use these libraries (after installation) you must reference them explicitly using the import
statement.
4.3 Pandas
A library that we will use extensively in this course is called pandas
. It is an open source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas
is already installed on Google Colab. To reference it, we need to use the import
statement to import the pandas library. We import it as pd
. pd
is our pandas
object.
Try it.
import pandas as pd
You can use any name you wish, within the rules of python. For example, you could write:
import pandas as tomatoes
The name you choose becomes a pandas object that you can reference to access relevant methods and functions that only work on pandas objects.
Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. For example, say you want to explore a data set stored in a .csv
file on your computer. CSV stands for comma separated values. Pandas will extract the data from that .csv
into a DataFrame — a table, basically — then let you do things like:
Calculate statistics and answer questions about the data, such as
- What's the average, median, max, or min of each column?
- Does column A correlate with column B?
- What does the distribution of data in column C look like?
- Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
- Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
Store the cleaned, transformed data back into a CSV, other file or database (https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/, para 7)
Check out the pandas API documentation that provides a reference and explanation of all the public pandas objects, functions and methods. https://pandas.pydata.org/pandas-docs/stable/reference/index.html
4.4 DataFrames
A DataFrame
is a two dimensional tabular, column-oriented data structure with both row and column labels (McKinney, p.4) that comes from the pandas
library.
In python, we can import and store data as a DataFrame
object. We can read a .csv file into python using the read_csv()
method that comes from the pandas library.
Since we imported pandas as pd, we can call the read_csv method using our newly created pandas object, pd
. The read_csv() method takes several parameters, with only one parameter as being required which is the location of the csv file (i.e filepath). We assign the variable mydata to the DataFrame that was created.
See the example below.
DO NOT TRY THIS, YET...
import pandas as pd
mydata = pd.read_csv("mba.csv") #basic import
type(mydata)
<class 'pandas.core.frame.DataFrame'>
mydata
Rank School ... Total Tuition ($) Duration (Months)
0 1 Chicago (Booth) ... 106800 21
1 2 Dartmouth (Tuck) ... 106980 21
2 3 Virginia (Darden) ... 107800 21
3 4 Harvard ... 107000 18
4 5 Columbia ... 111736 20
5 6 California At Berkeley (Haas) ... 106792 21
6 7 MIT (Sloan) ... 116400 22
7 8 Stanford ... 114600 21
8 9 IESE ... 95610 19
9 10 IMD ... 67416 11
10 11 New York (Stern) ... 96640 20
11 12 London ... 92144 15
12 13 Pennsylvania (Wharton) ... 107852 21
13 14 HEC Paris ... 66802 16
14 15 Cornell (Johnson) ... 107592 21
15 16 York (Schulich) ... 61800 8
16 17 Carnegie Mellon (Tepper) ... 108272 21
17 18 ESADE ... 81693 12
18 19 INSEAD ... 80719 10
19 20 Northwestern (Kellogg) ... 113100 22
20 21 Emory (Goizueta) ... 87200 22
21 22 IE ... 82389 13
22 23 UCLA (Anderson) ... 105160 21
23 24 Michigan (Ross) ... 105500 20
24 25 Bath ... 36057 12
[25 rows x 11 columns]
You can read the full read_csv() documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table
With all data files, it's important to understand how the data is organized. All variables should be columns and all observations should be stored as rows. Usually, csv files have a header. A header is name of the column names. This is typically the first row of data. Therefore, as best practice, you should specify header=0
(since python starts counting at 0), to indicate that the first row of data contains the variable names.
import pandas as pd
mydata = pd.read_csv("mba.csv", header=0) #add header
mydata
Rank School ... Total Tuition ($) Duration (Months)
0 1 Chicago (Booth) ... 106800 21
1 2 Dartmouth (Tuck) ... 106980 21
2 3 Virginia (Darden) ... 107800 21
3 4 Harvard ... 107000 18
4 5 Columbia ... 111736 20
5 6 California At Berkeley (Haas) ... 106792 21
6 7 MIT (Sloan) ... 116400 22
7 8 Stanford ... 114600 21
8 9 IESE ... 95610 19
9 10 IMD ... 67416 11
10 11 New York (Stern) ... 96640 20
11 12 London ... 92144 15
12 13 Pennsylvania (Wharton) ... 107852 21
13 14 HEC Paris ... 66802 16
14 15 Cornell (Johnson) ... 107592 21
15 16 York (Schulich) ... 61800 8
16 17 Carnegie Mellon (Tepper) ... 108272 21
17 18 ESADE ... 81693 12
18 19 INSEAD ... 80719 10
19 20 Northwestern (Kellogg) ... 113100 22
20 21 Emory (Goizueta) ... 87200 22
21 22 IE ... 82389 13
22 23 UCLA (Anderson) ... 105160 21
23 24 Michigan (Ross) ... 105500 20
24 25 Bath ... 36057 12
[25 rows x 11 columns]
Using Google Colab to import data in python
Before attempting to replicate the code above, you will need to download the mba.csv file from (http://becomingvisual.com/python4data/mba.csv) to your computer.
Next, implement the one of the options below to read in the mba.csv data.
Option 1: Read in file directly from a URL
import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0)
Option 2: Upload a file and then read it in
##Code to allow you to directly upload your .csv
from google.colab import files
uploaded = files.upload()
##Modify read_csv to reference the .csv uploaded above
import pandas as pd
import io
mydata = pd.read_csv(io.BytesIO(uploaded['mba.csv']), header=0)
Viewing your data
Once your have imported your data, take a look at it. You can see a preview by typing in the name of your DataFrame mydata
.
mydata
Rank School ... Total Tuition ($) Duration (Months)
0 1 Chicago (Booth) ... 106800 21
1 2 Dartmouth (Tuck) ... 106980 21
2 3 Virginia (Darden) ... 107800 21
3 4 Harvard ... 107000 18
4 5 Columbia ... 111736 20
5 6 California At Berkeley (Haas) ... 106792 21
6 7 MIT (Sloan) ... 116400 22
7 8 Stanford ... 114600 21
8 9 IESE ... 95610 19
9 10 IMD ... 67416 11
10 11 New York (Stern) ... 96640 20
11 12 London ... 92144 15
12 13 Pennsylvania (Wharton) ... 107852 21
13 14 HEC Paris ... 66802 16
14 15 Cornell (Johnson) ... 107592 21
15 16 York (Schulich) ... 61800 8
16 17 Carnegie Mellon (Tepper) ... 108272 21
17 18 ESADE ... 81693 12
18 19 INSEAD ... 80719 10
19 20 Northwestern (Kellogg) ... 113100 22
20 21 Emory (Goizueta) ... 87200 22
21 22 IE ... 82389 13
22 23 UCLA (Anderson) ... 105160 21
23 24 Michigan (Ross) ... 105500 20
24 25 Bath ... 36057 12
[25 rows x 11 columns]
Alternatively, you can use a few functions such as .head()
or .tail()
to show the beginning and ending of the file respectively.
The .head()
method by default outputs the first five rows of your DataFrame. You call the methods by using the the mydata
object followed by a period and then the function name. See below.
mydata.head() # outputs the first five rows
Rank School ... Total Tuition ($) Duration (Months)
0 1 Chicago (Booth) ... 106800 21
1 2 Dartmouth (Tuck) ... 106980 21
2 3 Virginia (Darden) ... 107800 21
3 4 Harvard ... 107000 18
4 5 Columbia ... 111736 20
[5 rows x 11 columns]
.head()
outputs the first five rows of your DataFrame by default, but we could also pass a number as well: mydata.head(10)
would output the top ten rows, for example.
mydata.head(10) # outputs the first ten rows
Rank School ... Total Tuition ($) Duration (Months)
0 1 Chicago (Booth) ... 106800 21
1 2 Dartmouth (Tuck) ... 106980 21
2 3 Virginia (Darden) ... 107800 21
3 4 Harvard ... 107000 18
4 5 Columbia ... 111736 20
5 6 California At Berkeley (Haas) ... 106792 21
6 7 MIT (Sloan) ... 116400 22
7 8 Stanford ... 114600 21
8 9 IESE ... 95610 19
9 10 IMD ... 67416 11
[10 rows x 11 columns]
To see the last five rows use .tail()
. tail()
also accepts a number, and in this case we printing the last three rows.
mydata.tail(3) # outputs the last three rows
Rank School ... Total Tuition ($) Duration (Months)
22 23 UCLA (Anderson) ... 105160 21
23 24 Michigan (Ross) ... 105500 20
24 25 Bath ... 36057 12
[3 rows x 11 columns]
Getting info about your data
The .info()
function provides the essential details about your data set, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
mydata.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 11 columns):
Rank 25 non-null int64
School 25 non-null object
Country 25 non-null object
AvgSalary 25 non-null int64
PreSalary 25 non-null int64
GradJobs 25 non-null int64
PhD 25 non-null int64
Avg. Age of Students 25 non-null int64
Avg. work exp. of students (Months) 25 non-null int64
Total Tuition ($) 25 non-null int64
Duration (Months) 25 non-null int64
dtypes: int64(9), object(2)
memory usage: 2.2+ KB
A fast and useful attribute is .shape
, which outputs just a tuple that shows the number of rows and columns:
mydata.shape
(25, 11)
Notice that we use .shape without closing parentheses. This is because we are accessing an attribute of our DataFrame object. As a reminder, an attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.
Accessing and renaming columns
Many times data sets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.
Here's how to print the column names of our data set:
The attribute .columns
come in handy if you want to see the column names or rename columns by allowing for simple copy and paste.
We can set a list of names to the columns to rename them.
mydata.columns # show original column names
Index(['Rank', 'School', 'Country', 'AvgSalary', 'PreSalary', 'GradJobs',
'PhD', 'Avg. Age of Students', 'Avg. work exp. of students (Months)',
'Total Tuition ($)', 'Duration (Months)'],
dtype='object')
mydata.columns = ['Rank', 'School', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students',
'Avg_Work_Experience', 'Tuition', 'Duration']
What if we wanted to make all the column names lowercase? We could repeat what we did above, but there's an easier way. Instead of just renaming each column manually we can do a list comprehension ( a more advanced concept, but try the code below anyway):
mydata.columns = [col.lower() for col in mydata]
mydata.columns
Index(['rank', 'school', 'country', 'average_salary', 'pre_salary',
'grad_jobs', 'phd', 'avg_age_students', 'avg_work_experience',
'tuition', 'duration'],
dtype='object')
The code [col.lower() for col in mydata]
is quite simple. We create an string object named col
. You can call it pineapples; the name doesn't matter. Then we reference the .lower() method from the string class. The for
is an iterator that allows us to traverse the column names in the DataFrame and change them to lowercase lettering.
Aside: List comprehensions are used for creating new list from other iterables. As list comprehension returns list, they consists of brackets containing the expression which needs to be executed for each element along with the for loop to iterate over each element.The square brackets signifies that the output is a list.
Accessing and renaming rows
Now that we looked at how to access and rename columns, let's look at the rows in our data.
To see the names of the rows we look at the index. The index, by default, begins at zero. Therefore the first row index will be 0.
mydata.index.values
will return the names of your index (or rows). Let's try it.
mydata.index #returns the range of index values
RangeIndex(start=0, stop=25, step=1)
mydata.index.values # returns the index values
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
To rename the index, you can 1) set another column to be the index such as Rank' when your read in the data. or 2) Manually assign a list of equal length the index by updating it using
.index`.
Setting another column as the index when you read in data
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0, index_col="Rank") #set index to Rank. Note since we reimported our data, rank is now in mixed case lettering, not lower case.
#change to lowercase
mydata.columns = [col.lower() for col in mydata]
print(mydata.index.values)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25]
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0)
#change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index =mydata["rank"]
print(mydata.index.values)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25]
print(mydata.loc[1,:]) #prints the first row and all columns
rank 1
school Chicago (Booth)
country US
avgsalary 113217
presalary 63
gradjobs 93
phd 96
avg. age of students 27
avg. work exp. of students (months) 60
total tuition ($) 106800
duration (months) 21
Name: 1, dtype: object
Updating the index using .index
.
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0) #change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
mydata.index.values
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25])
Note: Values for each row should be unique. DataFrame slicing by column
Just as we did with Lists in lesson 3, we can extract subsets of a DataFrame.
You can extract a single column from mydata
by using square brackets like this:
school_col = mydata['school'] #use lowercase
When we extract columns using the square brackets the result is a new data structure called a Series
.
type(school_col)
<class 'pandas.core.series.Series'>
"A Series is a one-dimensional array-like object containing an array of data (of NumPy
data type) and associated array of data labels, called its index" (McKinney, 2013, p. 112). Series are similar to Lists.
To extract a column as a DataFrame, you need to pass a list of column names. In our case that's just a single column:
school_col = mydata[['school']]
type(school_col)
<class 'pandas.core.frame.DataFrame'>
To extract more than on column, simple add it to the List of items to extract.
subset_mydata = mydata[['school', 'rank']]
subset_mydata.head()
school rank
1 Chicago (Booth) 1
2 Dartmouth (Tuck) 2
3 Virginia (Darden) 3
4 Harvard 4
5 Columbia 5
DataFrame extracting by row
For rows, we have two options:
.loc
- locates the row by name (the name is always the index unless you assign it.).iloc
- locates by numerical index
Let's try using .loc. Below we use the number 10 to reference the school ranked 11th.
nyu = mydata.loc[10]
nyu
rank 10
school IMD
country Switz.
avgsalary 145264
presalary 77
gradjobs 95
phd 100
avg. age of students 31
avg. work exp. of students (months) 84
total tuition ($) 67416
duration (months) 11
Name: 10, dtype: object
To extract the 11th row, we would use 10 as the index.
nyu = mydata.iloc[10]
nyu
rank 11
school New York (Stern)
country US
avgsalary 105798
presalary 49
gradjobs 93
phd 100
avg. age of students 27
avg. work exp. of students (months) 55
total tuition ($) 96640
duration (months) 20
Name: 11, dtype: object
To select the rows 3 (index 2) through 5 (index 4) use the colon. Note, just as with Lists, the range is exclusive of the last value (e.g. 2:5). Therefore, to extract index 2 through 4, 2:5 must be the parameter.
afewrows = mydata.iloc[2:5]
afewrows
rank school ... total tuition ($) duration (months)
3 3 Virginia (Darden) ... 107800 21
4 4 Harvard ... 107000 18
5 5 Columbia ... 111736 20
[3 rows x 11 columns]
The loc()
function is inclusive of the last value in the range.
afewrows = mydata.loc[1:5]
afewrows
rank school ... total tuition ($) duration (months)
1 1 Chicago (Booth) ... 106800 21
2 2 Dartmouth (Tuck) ... 106980 21
3 3 Virginia (Darden) ... 107800 21
4 4 Harvard ... 107000 18
5 5 Columbia ... 111736 20
[5 rows x 11 columns]
Adding a new column to a DataFrame
To add a new column to an existing DataFrame use the square brackets to define the new column and assign it the values to the column as shown below.
mydata["newcolumn"] = 0 # every observation for the column will be zero.
4.5 Summary
- Python is an object oriented programming language (OO).Objects are created from classes. A classe define the methods, data, and attributes related to ano object of that class type.
- Data files can be read into python as a DataFrame using the
read_csv()
method from the pandas library. - DataFrame is data structure for storing data in a table like format
- DataFrame comes from the pandas library.
- The key methods for working deriving information about a DataFrame include:
.info
,.shape
, ,.columns
,.index
,.index.values
, and.dtypes
. - Slicing DataFrames convert the new subset into a data structure of type Series when single square brackets are used as in
subset = mydf[1]
- Use double square brackets (i.e. an index and a list) to subset DataFrames and return objects of type DataFrame as in
subset = mydf[[1]]
. - Use the
iloc
function to extract (or slice) a range of rows from a DataFrame based on the index. Theiloc
function is exclusive of the last element. Alternatively, use theloc
function to extract (or slice) a range of rows from a DataFrame based on the row name. Theloc
function is inclusive of the last element, however. - In most cases the row name will be the same as the index. The index can be changed by reassignment using
mydf.index
or by specifying the index upon reading in the data.
Resources
Python Software Foundation (2019). Introduction. Available at: https://docs.python.org/3/library/intro.html
Python Software Foundation (2019): Functions. Available at: https://docs.python.org/3/library/functions.html
Basics of OO http://openbookproject.net/thinkcs/python/english3e/classes_and_objects_I.html
Exercise 4.1
- Download the portfolio.csv file from http://becomingvisual.com/python4data/portfolio.csv and import it into Python as a DataFrame named
portfolio
. - Use the appropriate method to show the data types for each column.
- Use the appropriate method to show the number of columns and rows in the portfolio DataFrame. Write a pretty sentence that uses this data that reads:
There are 14 rows and 4 columns in the portfolio DataFrame.
- Rename the first column in the DataFrame to
Security
.
Exercise 4.2
You decide to purchase 5 more shares of V, change the number of V shares to 55 (hint: row 9 of the Shares column
df.loc[row #, ‘column name’]=new value
)Add a new column to your DataFrame that calculates the’ market_value’ of each security (hint:
portfolio.Price*portfolio.Shares
)Use the
.sum()
function to sum the total of your new market_value column to find the value of your portfolio and call this variable ‘total_value’
Assignment 4
Download the windenergy.csv file from http://becomingvisual.com/python4data/windenergy.csv and import it into python. This data breaks down utility scale of wind turbines per state in the US.
Print out the top 10 states utilizing the most wind power by rank.
Create a new DataFrame that pulls in
State
,Ranking
,Installed Capacity
,Num of Turbines
, andHomes Powered
Print out the data for our great state of New York!
References
TK. 2017. “Python 101: Object Oriented Programming Part 1.” https://medium.com/the-renaissance-developer/python-101-object-oriented-programming-part-1-7d5d06833f26.