Lesson 4 Objects and Data Structures II
Welcome to lesson 4. In this lesson, you will be introduced to the fundamentals of object-oriented programming. We’ll discuss python’s pre-installed library and a library called pandas. Next, you’ll learn how to import and view external data sets.
Follow along with this tutorial by creating a new ipython notebook named lesson04.ipynb and entering the code snippets presented.
Outline
- Object-oriented programming
- Libraries and modules
- The
datetime
module - Pandas
- Pandas DataFrames
- Summary
- Exercise 4.1
- Exercise 4.2
- Assignment 4
4.1 Object-oriented programming
Python is an object-oriented programming language (OO). The python code we have written up until this point has been in the form of procedural or top-down programming. Essentially, we’ve written short programs in one file (notebook) that achieves a task (or two) in a step-by-step sequence.
In writing those programs, we have referenced and created many objects. We’ve been able to modify the attributes of those objects by using methods or functions. For example, we’ve created variables of type str. Strings are a type of object (and data type).
For example, myname = "Kristen"
creates a str object named Kristen. You can use methods to alter the object such as myname.upper()
. The variable declaration and assignment of the variable to a string created or instantiated the string object. Once an object is instantiated (from the string class) then you can use the methods to alter attributes or characteristics of the object.
To ground your understanding, review these four terms (TK (2017), para 1). We’ll begin to use them more frequently throughout the course.
Terminology
Objects: a representation of real-world objects like books, cars, cats, etc. The objects share two main characteristics: data and behavior.
Attributes: the data or information about the object. For example, cars have data like the number of wheels, number of doors, seating capacity. An attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.
Methods: the behavior of an object. For example, cars accelerate, stop, and show how much fuel is missing. In this course, we’ve been using the term
function
. For our purposes, let’s assume methods are synonymous with functions.Class: a blueprint for an object that defines the attributes and behaviors (methods). An object is an instance of a class. In the real world, we often find many objects of the same type. Like cars. All the same make and model (have an engine, wheels, doors, etc). Each car was built from the same set of blueprints and has the same components.
As you progress in your learning of python, you will build your classes that define the attributes and behaviors of objects. For now, let’s use classes from python libraries that already exist to build objects.
4.2 Libraries and Modules
“The Python library contains several different kinds of components. It contains data types that would normally be considered part of the core of a language, such as numbers and lists. For these types, the Python language core defines the form of literals and places some constraints on their semantics but does not fully define the semantics. (On the other hand, the language core does define syntactic properties like the spelling and priorities of operators.) The library also contains built-in functions and exceptions — objects that can be used by all Python code without the need of an import statement” (Python Software Foundation, 2019, para 1.)
In other words, in python there is a set of built-in functions. As you recall, you can find the full list at https://docs.python.org/3/library/functions.html and the full python standard library at https://docs.python.org/3/library/.
However, python is a growing language that has many contributors creating libraries (or modules) of new classes that go beyond the set of built-in functionality. In addition to those libraries that come pre-installed, these are additional libraries that can be installed and used. To use these libraries (after installation) you must reference them explicitly using the import
statement.
For example, let’s look at the datetime
module. Datetime module comes built into Python, so there is no need to install it externally. The code below uses the import statement to reference the datetime
module. This provides us access to the classes and methods from the module, such as date
class and today()
method.
## 2023-09-15
4.3 The datetime
module
“Date formatting is one of the most important tasks that you will face as a programmer. Different regions around the world have different ways of representing dates/times, therefore your goal as a programmer is to present the date values in a way that is readable to the users” (https://stackabuse.com/how-to-format-dates-in-python/, 2020, para 2).
Several classes exist in the datetime module. the classes are referenced in the python documentation at: https://docs.python.org/3/library/datetime.html#module-datetime
Some classes included in the datetime module include:
date
– Based on Gregorian calendar with the attributes of year, month and day.time
– An idealized time, independent of any particular day, assuming that every day has exactly 24 hours 60 minutes, and 60 seconds. Its attributes are hour, minute, second, microsecond, and tzinfo.datetime
– Its a combination of date and time along with the attributes year, month, day, hour, minute, second, microsecond, and tzinfo.timedelta
– A duration expressing the difference between two date, time, or datetime instances to microsecond resolution.
Refer to https://www.geeksforgeeks.org/python-datetime-module-with-examples/ for more information.
Your python code gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the datetime
module, but it is not the only way.
An alternative to referencing the module and class is to use the from
keyword to reference the module and then using import
statement to reference the date
class as shown below:
## 2023-09-15
If you wanted to change the format of the date output returned from the .today()
method we would use a special formatting method called .strftime()
.
For example, you may need to represent a date value numerically like 09-29-2020
. Or you may need to write the same date value in a longer textual format like September 29, 2020
. In another scenario, you may want to extract the month in string format from a numerically formatted date value.
Let’s extract the current month as a numeric output.
from datetime import date
month=date.today().strftime('%m')
print("We are in month", month, end=".")
## We are in month 09.
Note: This output is presented as a numeric, the data type is an str
of the month variable.
In the above code, the %m
is a format code. The strftime()
method takes one or more format codes as an argument and returns a formatted string based on it.
The table below shows all the codes that you can pass to the strftime()
method.
Format code list for dates and times
Directive | Meaning | Example |
---|---|---|
%a | Abbreviated weekday name. | Sun, Mon, … |
%A | Full weekday name. | Sunday, Monday, … |
%w | Weekday as a decimal number. | 0, 1, …, 6 |
%d | Day of the month as a zero-padded decimal. | 01, 02, …, 31 |
%-d | Day of the month as a decimal number. | 1, 2, …, 30 |
%b | Abbreviated month name. | Jan, Feb, …, Dec |
%B | Full month name. | January, February, … |
%m | Month as a zero-padded decimal number. | 01, 02, …, 12 |
%-m | Month as a decimal number. | 1, 2, …, 12 |
%y | Year without century as a zero-padded decimal number. | 00, 01, …, 99 |
%-y | Year without century as a decimal number. | 0, 1, …, 99 |
%Y | Year with century as a decimal number. | 2013, 2019 etc. |
%H | Hour (24-hour clock) as a zero-padded decimal number. | 00, 01, …, 23 |
%-H | Hour (24-hour clock) as a decimal number. | 0, 1, …, 23 |
%I | Hour (12-hour clock) as a zero-padded decimal number. | 01, 02, …, 12 |
%-I | Hour (12-hour clock) as a decimal number. | 1, 2, … 12 |
%p | Locale’s AM or PM. AM, PM | |
%M | Minute as a zero-padded decimal number. | 00, 01, …, 59 |
%-M | Minute as a decimal number. | 0, 1, …, 59 |
%S | Second as a zero-padded decimal number. | 00, 01, …, 59 |
%-S | Second as a decimal number. | 0, 1, …, 59 |
%f | Microsecond as a decimal number, zero-padded on the left. | 000000 - 999999 |
%z | UTC offset in the form | +HHMM or -HHMM. |
%Z | Time zone name. | |
%j | Day of the year as a zero-padded decimal number. | 001, 002, …, 366 |
%-j | Day of the year as a decimal number. | 1, 2, …, 366 |
%U | Week number of the year (Sunday as the first day of the week). All days in a new year preceding the first Sunday are considered to be in week | 0. 00, 01, …, 53 |
%W | Week number of the year (Monday as the first day of the week). All days in a new year preceding the first Monday are considered to be in week | 0. 00, 01, …, 53 |
%c | Locale’s appropriate date and time representation. | Mon Sep 30 07:06:05 2013 |
%x | Locale’s appropriate date representation. | 09/30/13 |
%X | Locale’s appropriate time representation. | 07:06:05 |
%% | A literal ‘%’ character. | % |
To learn more go to: https://www.programiz.com/python-programming/datetime/strftime
Let’s revise our code above and print out the actual name of the month.
from datetime import date
current_month_text = date.today().strftime('%B')
print("We are in the month of", current_month_text, end=".")
## We are in the month of September.
I encourage you to try out working with the other date format codes.
What if you wanted to access the time or the date AND the time? In addition to the date
class, we can also work with the datetime
class to extract the date and time, and the pytz
module to set our current timezone.
# importing datetime module for now()
import datetime
# using now() to get current time
current_time = datetime.datetime.now()
print(current_time)
## 2023-09-15 11:46:08.079488
You may not notice output is incorrect if you run this code on Google Colab. You may see a full date with an incorrect time. This Greenwich Mean Time (GMT), not your time for your timezone. To localize the time you’ll need to set the timezone using the pytz
module and the .timezone()
method.
# importing datetime module for now()
import datetime
# for timezone()
import pytz
# using now() to get current time
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print(current_time)
print(type(current_time))
You can access the attributes of the datetime data returned from the datetime.now().
The attributes of .now()
are :
- year
- month
- day
- hour
- minute
- second
- microsecond
You can easily reference these attributes using your datetime object and call the specific attribute. For example, current_time.hour
. See below.
# importing datetime module for now()
import datetime
import pytz
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print ("The time is: ", current_time.hour,":",current_time.strftime('%M'), current_time.strftime('%p'), sep="", end=".")
You’ll notice I only call the .hour
attribute and use .strftime
to properly format the minutes and to print AM or PM. But something may still look off. The time is based on a 24-hour clock, rather than a 12-hour clock, which many of us are accustomed to seeing when the AM/PM designation is provided.
Try it yourself
Print out the current time on a 12-hour clock with no padded decimals for an hour and padded decimals for a minute. Add whether it is AM or PM.
# importing datetime module for now()
import datetime
import pytz
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print ("The time is: ", current_time.strftime('%-I'),":",current_time.strftime('%M'), current_time.strftime('%p'), sep="", end=".")
Converting back to a date format
There will be times where you may have data that looks like datetime format but is a string. You will need to cast or convert these objects to the proper datetime object (i.e. data type). Obviously, your string will need to be in the right format to convert it to a datetime object.
from datetime import datetime
dt_string = "09/08/2020 09:15:31"
# Considering date is in dd/mm/yyyy format
dt_object1 = datetime.strptime(dt_string, "%d/%m/%Y %H:%M:%S")
print("dt_object1 =", dt_object1)
## dt_object1 = 2020-08-09 09:15:31
dt_object2 = datetime.strptime(dt_string, "%m/%d/%Y %H:%M:%S")
# Considering date is instead in the mm/dd/yyyy format
print("dt_object2 =", dt_object2)
## dt_object2 = 2020-09-08 09:15:31
4.4 Pandas
Another library that we will use extensively in this course is called pandas
. It is an open-source library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. pandas
is already installed on Google Colab. To reference it, we need to use the import
statement to import the pandas library. We import it as pd
. pd
is our pandas
object.
Try it.
You can use any name you wish, within the rules of python. For example, you could write:
The name you choose becomes a pandas object that you can reference to access relevant methods and functions that only work on pandas objects.
Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. For example, say you want to explore a data set stored in a .csv
file on your computer. CSV stands for comma-separated values. Pandas will extract the data from that .csv
into a DataFrame — a table, basically — then let you do things like:
Calculate statistics and answer questions about the data, such as
What’s the average, median, max, or min of each column?
Does column A correlate with column B?
What does the distribution of data in column C look like?
Clean the data by doing things like removing missing values and filtering rows or columns by some criteria
Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.
Store the cleaned, transformed data back into a CSV, other file or database (https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/, para 7)
Check out the pandas API documentation that provides a reference and explanation of all the public pandas objects, functions, and methods. https://pandas.pydata.org/pandas-docs/stable/reference/index.html
4.5 DataFrames
A DataFrame
is a two-dimensional tabular, column-oriented data structure with both row and column labels (McKinney, p.4) that comes from the pandas
library.
In python, we can import and store data as a DataFrame
object.
- Panda’s IO tools contain a set of readers and writers.
- Pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json,…).
- Importing data from each of these data sources are provided by reader function with the prefix
pandas.read_*
.- The corresponding writer functions are object methods that are accessed like
DataFrame.to_*()
4.5.1 Readers and writers
The most common readers and writers that we will use include:
Format Type | Data Description | Reader | Writer |
---|---|---|---|
text | CSV | read_csv | to_csv |
text | JSON | read_json | to_json |
text | HTML | read_html | to_html |
text | XML | read_xml | to_xml |
binary | MS Excel | read_excel | to_excel |
SQL | SQL | read_sql | to_sql |
For a full list: https://pandas.pydata.org/docs/user_guide/io.html#
We can read a .csv file into python using the read_csv()
method that comes from the pandas library.
Since we imported pandas as pd, we can call the read_csv method using our newly created pandas object, pd
. The read_csv() method takes several parameters, with only one parameter as being required which is the location of the csv file (i.e filepath). We assign the variable mydata to the DataFrame that was created.
See the example below.
DO NOT TRY THIS, YET…
<class 'pandas.core.frame.DataFrame'>
Rank ... Duration (Months)
0 1 ... 12
1 2 ... 21
2 3 ... 21
3 4 ... 21
4 5 ... 21
5 6 ... 22
6 7 ... 18
7 8 ... 16
8 9 ... 21
9 10 ... 24
10 11 ... 21
11 12 ... 18
12 13 ... 22
13 14 ... 17
14 15 ... 21
15 16 ... 12
16 17 ... 21
17 18 ... 22
18 19 ... 12
19 20 ... 18
20 21 ... 22
21 22 ... 16
22 23 ... 18
23 24 ... 20
24 25 ... 24
[25 rows x 11 columns]
You can read the full read_csv() documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table
With all data files, it’s important to understand how the data is organized. All variables should be columns and all observations should be stored as rows. Usually, csv files have a header. A header is the name of the column names. This is typically the first row of data. Therefore, as best practice, you should specify header=0
(since python starts counting at 0), to indicate that the first row of data contains the variable names.
Using Google Colab to import data in python
Before attempting to replicate the code above, you will need to download the mba.csv file from (http://becomingvisual.com/python4data/mba_2021.csv) to your computer.
Next, implement one of the options below to read in the mba.csv data.
Option 1: Read in file directly from a URL
import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0)
Option 2: Upload a file and then read it in
##Code to allow you to directly upload your .csv
from google.colab import files
uploaded = files.upload()
type(uploaded) #uploaded is a dictionary
uploaded.keys() #dict_keys(['mba_2021.csv'])
##Modify read_csv to reference the .csv uploaded above
import pandas as pd
import io
mydata = pd.read_csv(io.BytesIO(uploaded['mba_2021.csv']), header=0)
Viewing your data
Once you have imported your data, take a look at it. You can see a preview by typing in the name of your DataFrame mydata
.
Rank ... Duration (Months)
0 1 ... 12
1 2 ... 21
2 3 ... 21
3 4 ... 21
4 5 ... 21
5 6 ... 22
6 7 ... 18
7 8 ... 16
8 9 ... 21
9 10 ... 24
10 11 ... 21
11 12 ... 18
12 13 ... 22
13 14 ... 17
14 15 ... 21
15 16 ... 12
16 17 ... 21
17 18 ... 22
18 19 ... 12
19 20 ... 18
20 21 ... 22
21 22 ... 16
22 23 ... 18
23 24 ... 20
24 25 ... 24
[25 rows x 11 columns]
Alternatively, you can use a few functions such as .head()
or .tail()
to show the beginning and end of the file respectively.
The .head()
method by default outputs the first five rows of your DataFrame. You call the methods by using the mydata
object followed by a period and then the function name. See below.
Rank School ... Total Tuition ($) Duration (Months)
0 1 INSEAD ... 83832 12
1 2 LBS ... 119359 21
2 3 University of Chicago (Booth) ... 111855 21
3 4 IESE ... 107049 21
4 5 Yale ... 104752 21
[5 rows x 11 columns]
.head()
outputs the first five rows of your DataFrame by default, but we could pass a number as well: mydata.head(10)
would output the top ten rows, for example.
Rank School ... Total Tuition ($) Duration (Months)
0 1 INSEAD ... 83832 12
1 2 LBS ... 119359 21
2 3 University of Chicago (Booth) ... 111855 21
3 4 IESE ... 107049 21
4 5 Yale ... 104752 21
5 6 Northwestern (Kellogg) ... 111658 22
6 7 Ceibs ... 126001 18
7 8 HEC Paris ... 116713 16
8 9 Duke University (Fuqua) ... 97554 21
9 10 Dartmouth College (Tuck) ... 118047 24
[10 rows x 11 columns]
To see the last five rows use .tail()
. tail()
also accepts a number, and in this case, we printing the last three rows.
Rank School ... Total Tuition ($) Duration (Months)
22 23 Indian School of Business ... 50000 18
23 24 USC (Marshall) ... 102964 20
24 25 WASHU (Olin) ... 69461 24
[3 rows x 11 columns]
Getting info about your data
The .info()
function provides the essential details about your data set, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Rank 25 non-null int64
1 School 25 non-null object
2 Country 25 non-null object
3 AvgSalary 25 non-null int64
4 PreSalary 25 non-null int64
5 GradJobs 25 non-null int64
6 PhD 25 non-null int64
7 Avg. Age of Students 25 non-null int64
8 Avg. work exp. of students (Months) 25 non-null int64
9 Total Tuition ($) 25 non-null int64
10 Duration (Months) 25 non-null int64
dtypes: int64(9), object(2)
memory usage: 2.3+ KB
A fast and useful attribute is .shape
, which outputs just a tuple that shows the number of rows and columns:
(25, 11)
Notice that we use .shape without closing parentheses. This is because we are accessing an attribute of our DataFrame object. As a reminder, an attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.
Accessing and renaming columns
Many times data sets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.
Here’s how to print the column names of our data set:
The attribute .columns
comes in handy if you want to see the column names or rename columns by allowing for simple copy and paste.
We can set a list of names to the columns to rename them.
Index(['Rank', 'School', 'Country', 'AvgSalary', 'PreSalary', 'GradJobs',
'PhD', 'Avg. Age of Students', 'Avg. work exp. of students (Months)',
'Total Tuition ($)', 'Duration (Months)'],
dtype='object')
mydata.columns = ['Rank', 'School', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students',
'Avg_Work_Experience', 'Tuition', 'Duration']
What if we wanted to make all the column names lowercase? We could repeat what we did above, but there’s an easier way. Instead of just renaming each column manually we can do a list comprehension ( a more advanced concept, but try the code below anyway):
Index(['rank', 'school', 'country', 'average_salary', 'pre_salary',
'grad_jobs', 'phd', 'avg_age_students', 'avg_work_experience',
'tuition', 'duration'],
dtype='object')
The code [col.lower() for col in mydata]
is quite simple. We create a string object named col
. You can call it pineapples; the name doesn’t matter. Then we reference the .lower() method from the string class. The for
is an iterator that allows us to traverse the column names in the DataFrame and change them to lowercase lettering.
Aside: List comprehensions are used for creating new list from other iterables. As list comprehension returns a list, they consist of brackets containing the expression which needs to be executed for each element along with the for loop to iterate over each element. The square brackets signify that the output is a list.
Accessing and renaming rows
Now that we looked at how to access and rename columns, let’s look at the rows in our data.
To see the names of the rows we look at the index. The index, by default, begins at zero. Therefore the first-row index will be 0.
mydata.index.values
will return the names of your index (or rows). Let’s try it.
RangeIndex(start=0, stop=25, step=1)
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24])
To rename the index, you can 1) set another column to be the index such as Rank' when your read in the data. or 2) Manually assign a list of equal length to the index by updating it using
.index`.
Setting another column as the index when you read in data
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0, index_col="Rank") #set index to Rank. Note since we reimported our data, rank is now in mixed case lettering, not lower case.
#change to lowercase
mydata.columns = [col.lower() for col in mydata]
print(mydata.index.values)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25]
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0)
#change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index =mydata["rank"]
print(mydata.index.values)
[ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25]
rank 1
school INSEAD
country France/Singapore
avgsalary 188432
presalary 96
gradjobs 83
phd 98
avg. age of students 29
avg. work exp. of students (months) 44
total tuition ($) 83832
duration (months) 12
Name: 1, dtype: object
Updating the index using .index
.
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0) #change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
mydata.index.values
array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,
18, 19, 20, 21, 22, 23, 24, 25])
Note: Values for each row should be unique.
DataFrame slicing by column
Just as we did with Lists in lesson 3, we can extract subsets of a DataFrame.
You can extract a single column from mydata
by using square brackets like this:
When we extract columns using the square brackets the result is a new data structure called a Series
.
<class 'pandas.core.series.Series'>
“A Series is a one-dimensional array-like object containing an array of data (of NumPy
data type) and an associated array of data labels, called its index” (McKinney, 2013, p. 112). Series are similar to Lists.
To extract a column as a DataFrame, you need to pass a list of column names. In our case that’s just a single column:
<class 'pandas.core.frame.DataFrame'>
To extract more than one column, simply add it to the List of items to extract.
school rank
1 Chicago (Booth) 1
2 Dartmouth (Tuck) 2
3 Virginia (Darden) 3
4 Harvard 4
5 Columbia 5
DataFrame extracting by row
For rows, we have two options:
.loc
- locates the row by name (the name is always the index unless you assign it.).iloc
- locates by numerical index
Let’s try using .loc. Below we use the number 12 to reference the school ranked 13th.
rank 12
school London
country Britain
avgsalary 118514
presalary 68
gradjobs 93
phd 100
avg. age of students 29
avg. work exp. of students (months) 60
total tuition ($) 92144
duration (months) 15
Name: 12, dtype: object
To extract the 13th row, we would use 12 as the index.
rank 13
school Pennsylvania (Wharton)
country US
avgsalary 118024
presalary 50
gradjobs 95
phd 100
avg. age of students 28
avg. work exp. of students (months) 55
total tuition ($) 107852
duration (months) 21
Name: 13, dtype: object
To select rows 3 (index 2) through 5 (index 4), use the colon. Note, just as with Lists, the range is exclusive of the last value (e.g. 2:5). Therefore, to extract index 2 through 4, 2:5 must be the parameter.
rank school ... total tuition ($) duration (months)
3 3 Virginia (Darden) ... 107800 21
4 4 Harvard ... 107000 18
5 5 Columbia ... 111736 20
[3 rows x 11 columns]
The loc()
function is inclusive of the last value in the range.
rank school ... total tuition ($) duration (months)
1 1 Chicago (Booth) ... 106800 21
2 2 Dartmouth (Tuck) ... 106980 21
3 3 Virginia (Darden) ... 107800 21
4 4 Harvard ... 107000 18
5 5 Columbia ... 111736 20
[5 rows x 11 columns]
Adding a new column to a DataFrame
To add a new column to an existing DataFrame use the square brackets to define the new column and assign the values to the column as shown below.
4.5.2 Dataframe attributes
Attribute name | Description |
---|---|
.shape |
Returns a tuple that show number of rows and columns |
.index |
Returns the index range |
.columns |
Returns the names of each column |
.dtypes |
Returns the data type for each column |
4.5.3 Dataframe Methods
Method name | Description |
---|---|
head() |
Select the first n rows |
tail() |
Select the last n rows |
nsmallest() |
Select and order the bottom n entries |
nlargest() |
Select and order the top n entries |
sample() |
Randomly select n rows |
type() |
Use type(DataFrame ) to check the type of object |
info() |
Learn the data types of your columns, number of rows, non-null values |
describe() |
Run summary statistics on your data |
sort_values() |
Order rows by values of column (high to low) |
rename() |
Renames columns of a DataFrame |
drop() |
Drops columns from a DataFrame |
copy() |
Creates a copy of the DataFrame, rather than a slice |
4.5.4 Dataframe subsetting techniques
Technique | Description |
---|---|
df.["column_name1", "column_name2"] |
selects one or more columns |
df.column_name1 or df["column_name1"] |
selects a single column |
df.at[1,"Poker"] |
selects row by index and column by name for a single value |
df.iat[1, 2] |
selects row and column by index and position for a single value |
df.iloc[:, [1,2,5]] |
selects all rows and columns by position |
df.loc[:, "Slot":"Bac"] |
selects all rows and columns by name |
4.6 Summary
- Python is an object-oriented programming language (OO). Objects are created from classes. A class defines the methods, data, and attributes related to an object of that class type.
- The
datetime
module allows you to retrieve the current date and time and create your owndatetime
objects. - Data files can be read into python as a DataFrame using the
read_csv()
method from the pandas library. - DataFrame is a data structure for storing data in a table-like format
- DataFrame comes from the pandas library.
- The key methods for working deriving information about a DataFrame include:
.info
,.shape
, ,.columns
,.index
,.index.values
, and.dtypes
. - Slicing DataFrames convert the new subset into a data structure of type Series when single square brackets are used as in
subset = mydf[1]
- Use double square brackets (i.e. an index and a list) to subset DataFrames and return objects of type DataFrame as in
subset = mydf[[1]]
. - Use the
iloc
function to extract (or slice) a range of rows from a DataFrame based on the index. Theiloc
function is exclusive of the last element. Alternatively, use theloc
function to extract (or slice) a range of rows from a DataFrame based on the row name. Theloc
function is inclusive of the last element, however. - It’s best to use .copy() when you want to extract a subset of a dataframe independent of the original dataframe.
- In most cases the row name will be the same as the index. The index can be changed by reassignment using
mydf.index
or by specifying the index upon reading in the data.
Resources
Python Software Foundation (2019). Introduction. Available at: https://docs.python.org/3/library/intro.html
Python Software Foundation (2019): Functions. Available at: https://docs.python.org/3/library/functions.html
Basics of OO http://openbookproject.net/thinkcs/python/english3e/classes_and_objects_I.html
Exercise 4.1
- Download the portfolio.csv file from http://becomingvisual.com/python4data/portfolio.csv and import it into Python as a DataFrame named
portfolio
. - Use the appropriate method to show the data types for each column.
- Use the appropriate method to show the number of columns and rows in the portfolio DataFrame. Write a pretty sentence that uses this data that reads:
There are 14 rows and 4 columns in the portfolio DataFrame.
- Rename the first column in the DataFrame to
Security
.
Exercise 4.2
You decide to purchase 5 more shares of V, change the number of V shares to 55 (hint: row 9 of the Shares column
df.loc[row #, ‘column name’]=new value
)Add a new column to your DataFrame that calculates the’ market_value’ of each security (hint:
portfolio.Price*portfolio.Shares
)Use the
.sum()
function to sum the total of your new market_value column to find the value of your portfolio and call this variable ‘total_value’
Assignment 4
Part 1: Dataframes
Download the windenergy.csv file from http://becomingvisual.com/python4data/windenergy.csv and import it into python. This data breaks down the utility-scale of wind turbines per state in the US.
Print out the top 10 states utilizing the most wind power by rank.
Create a new DataFrame that pulls in
State
,Ranking
,Installed Capacity
,Num of Turbines
, andHomes Powered
Print out the data for our great state of New York in a pretty sentence.
Part 2. Classes and objects
This question is not based on the data above.
- Go to the USGS website: https://eerscmap.usgs.gov/uswtdb/api-doc/#keyValue to learn more about the API they provide.
- Use the following code to read in the data from the API.
import requests
url = 'https://eersc.usgs.gov/api/uswtdb/v1/turbines'
results = requests.get(url).json()
results = pd.DataFrame(results)
Create a class called
turbines
with two default parameters: location and turbine_id. In your class definition, include at least four methods: getLocation, setLocation, getYear, and setYear.Create 3 turbine objects with the required attributes, and assign the year attribute. Select three turbines records from your
results
to create your three turbine objects. Provide meaningful output each time you create an instance (object) and set the attributes. Return the values for all attributes for each object. Use the variables t_state for your location attribute, case_id for the turbine_id attribute, and p_year for the year attribute.