Lesson 4 Objects and Data Structures II

Welcome to lesson 4. In this lesson, you will be introduced to the fundamentals of object-oriented programming. We’ll discuss python’s pre-installed library and a library called pandas. Next, you’ll learn how to import and view external data sets.

Follow along with this tutorial by creating a new ipython notebook named lesson04.ipynb and entering the code snippets presented.

Outline

  • Object-oriented programming
  • Libraries and modules
  • The datetime module
  • Pandas
  • Pandas DataFrames
  • Summary
  • Exercise 4.1
  • Exercise 4.2
  • Assignment 4

4.1 Object-oriented programming

Python is an object-oriented programming language (OO). The python code we have written up until this point has been in the form of procedural or top-down programming. Essentially, we’ve written short programs in one file (notebook) that achieves a task (or two) in a step-by-step sequence.

In writing those programs, we have referenced and created many objects. We’ve been able to modify the attributes of those objects by using methods or functions. For example, we’ve created variables of type str. Strings are a type of object (and data type).

For example, myname = "Kristen" creates a str object named Kristen. You can use methods to alter the object such as myname.upper(). The variable declaration and assignment of the variable to a string created or instantiated the string object. Once an object is instantiated (from the string class) then you can use the methods to alter attributes or characteristics of the object.

To ground your understanding, review these four terms (TK (2017), para 1). We’ll begin to use them more frequently throughout the course.

Terminology

  1. Objects: a representation of real-world objects like books, cars, cats, etc. The objects share two main characteristics: data and behavior.

  2. Attributes: the data or information about the object. For example, cars have data like the number of wheels, number of doors, seating capacity. An attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.

  3. Methods: the behavior of an object. For example, cars accelerate, stop, and show how much fuel is missing. In this course, we’ve been using the term function. For our purposes, let’s assume methods are synonymous with functions.

  4. Class: a blueprint for an object that defines the attributes and behaviors (methods). An object is an instance of a class. In the real world, we often find many objects of the same type. Like cars. All the same make and model (have an engine, wheels, doors, etc). Each car was built from the same set of blueprints and has the same components.

As you progress in your learning of python, you will build your classes that define the attributes and behaviors of objects. For now, let’s use classes from python libraries that already exist to build objects.

4.2 Libraries and Modules

“The Python library contains several different kinds of components. It contains data types that would normally be considered part of the core of a language, such as numbers and lists. For these types, the Python language core defines the form of literals and places some constraints on their semantics but does not fully define the semantics. (On the other hand, the language core does define syntactic properties like the spelling and priorities of operators.) The library also contains built-in functions and exceptions — objects that can be used by all Python code without the need of an import statement” (Python Software Foundation, 2019, para 1.)

In other words, in python there is a set of built-in functions. As you recall, you can find the full list at https://docs.python.org/3/library/functions.html and the full python standard library at https://docs.python.org/3/library/.

However, python is a growing language that has many contributors creating libraries (or modules) of new classes that go beyond the set of built-in functionality. In addition to those libraries that come pre-installed, these are additional libraries that can be installed and used. To use these libraries (after installation) you must reference them explicitly using the import statement.

For example, let’s look at the datetime module. Datetime module comes built into Python, so there is no need to install it externally. The code below uses the import statement to reference the datetime module. This provides us access to the classes and methods from the module, such as date class and today() method.

import datetime
today=datetime.date.today()
print(today)
## 2023-09-15

4.3 The datetime module

“Date formatting is one of the most important tasks that you will face as a programmer. Different regions around the world have different ways of representing dates/times, therefore your goal as a programmer is to present the date values in a way that is readable to the users” (https://stackabuse.com/how-to-format-dates-in-python/, 2020, para 2).

Several classes exist in the datetime module. the classes are referenced in the python documentation at: https://docs.python.org/3/library/datetime.html#module-datetime

Some classes included in the datetime module include:

  • date – Based on Gregorian calendar with the attributes of year, month and day.

  • time – An idealized time, independent of any particular day, assuming that every day has exactly 24 hours 60 minutes, and 60 seconds. Its attributes are hour, minute, second, microsecond, and tzinfo.

  • datetime – Its a combination of date and time along with the attributes year, month, day, hour, minute, second, microsecond, and tzinfo.

  • timedelta – A duration expressing the difference between two date, time, or datetime instances to microsecond resolution.

Refer to https://www.geeksforgeeks.org/python-datetime-module-with-examples/ for more information.

Your python code gains access to the code in another module by the process of importing it. The import statement is the most common way of invoking the datetime module, but it is not the only way.

An alternative to referencing the module and class is to use the from keyword to reference the module and then using import statement to reference the date class as shown below:

from datetime import date
today=date.today()
print(today)
## 2023-09-15

If you wanted to change the format of the date output returned from the .today() method we would use a special formatting method called .strftime().

For example, you may need to represent a date value numerically like 09-29-2020. Or you may need to write the same date value in a longer textual format like September 29, 2020. In another scenario, you may want to extract the month in string format from a numerically formatted date value.

Let’s extract the current month as a numeric output.

from datetime import date
month=date.today().strftime('%m') 
print("We are in month", month, end=".")
## We are in month 09.

Note: This output is presented as a numeric, the data type is an str of the month variable.

In the above code, the %m is a format code. The strftime() method takes one or more format codes as an argument and returns a formatted string based on it.

The table below shows all the codes that you can pass to the strftime() method.

Format code list for dates and times

Directive Meaning Example
%a Abbreviated weekday name. Sun, Mon, …
%A Full weekday name. Sunday, Monday, …
%w Weekday as a decimal number. 0, 1, …, 6
%d Day of the month as a zero-padded decimal. 01, 02, …, 31
%-d Day of the month as a decimal number. 1, 2, …, 30
%b Abbreviated month name. Jan, Feb, …, Dec
%B Full month name. January, February, …
%m Month as a zero-padded decimal number. 01, 02, …, 12
%-m Month as a decimal number. 1, 2, …, 12
%y Year without century as a zero-padded decimal number. 00, 01, …, 99
%-y Year without century as a decimal number. 0, 1, …, 99
%Y Year with century as a decimal number. 2013, 2019 etc.
%H Hour (24-hour clock) as a zero-padded decimal number. 00, 01, …, 23
%-H Hour (24-hour clock) as a decimal number. 0, 1, …, 23
%I Hour (12-hour clock) as a zero-padded decimal number. 01, 02, …, 12
%-I Hour (12-hour clock) as a decimal number. 1, 2, … 12
%p Locale’s AM or PM. AM, PM
%M Minute as a zero-padded decimal number. 00, 01, …, 59
%-M Minute as a decimal number. 0, 1, …, 59
%S Second as a zero-padded decimal number. 00, 01, …, 59
%-S Second as a decimal number. 0, 1, …, 59
%f Microsecond as a decimal number, zero-padded on the left. 000000 - 999999
%z UTC offset in the form +HHMM or -HHMM.
%Z Time zone name.
%j Day of the year as a zero-padded decimal number. 001, 002, …, 366
%-j Day of the year as a decimal number. 1, 2, …, 366
%U Week number of the year (Sunday as the first day of the week). All days in a new year preceding the first Sunday are considered to be in week 0. 00, 01, …, 53
%W Week number of the year (Monday as the first day of the week). All days in a new year preceding the first Monday are considered to be in week 0. 00, 01, …, 53
%c Locale’s appropriate date and time representation. Mon Sep 30 07:06:05 2013
%x Locale’s appropriate date representation. 09/30/13
%X Locale’s appropriate time representation. 07:06:05
%% A literal ‘%’ character. %

To learn more go to: https://www.programiz.com/python-programming/datetime/strftime

Let’s revise our code above and print out the actual name of the month.

from datetime import date
current_month_text = date.today().strftime('%B') 

print("We are in the month of", current_month_text, end=".")
## We are in the month of September.

I encourage you to try out working with the other date format codes.

What if you wanted to access the time or the date AND the time? In addition to the date class, we can also work with the datetime class to extract the date and time, and the pytz module to set our current timezone.

# importing datetime module for now()  
import datetime  
    
# using now() to get current time  
current_time = datetime.datetime.now()  
print(current_time)
## 2023-09-15 11:46:08.079488

You may not notice output is incorrect if you run this code on Google Colab. You may see a full date with an incorrect time. This Greenwich Mean Time (GMT), not your time for your timezone. To localize the time you’ll need to set the timezone using the pytz module and the .timezone() method.

# importing datetime module for now()  
import datetime  
# for timezone()  
import pytz  
# using now() to get current time  
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print(current_time)
print(type(current_time))

You can access the attributes of the datetime data returned from the datetime.now().

The attributes of .now() are :

  • year
  • month
  • day
  • hour
  • minute
  • second
  • microsecond

You can easily reference these attributes using your datetime object and call the specific attribute. For example, current_time.hour. See below.

# importing datetime module for now()  
import datetime  
import pytz  
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print ("The time is: ", current_time.hour,":",current_time.strftime('%M'), current_time.strftime('%p'), sep="", end=".")

You’ll notice I only call the .hour attribute and use .strftime to properly format the minutes and to print AM or PM. But something may still look off. The time is based on a 24-hour clock, rather than a 12-hour clock, which many of us are accustomed to seeing when the AM/PM designation is provided.

Try it yourself

Print out the current time on a 12-hour clock with no padded decimals for an hour and padded decimals for a minute. Add whether it is AM or PM.

# importing datetime module for now()  
import datetime  
import pytz  
current_time = datetime.datetime.now(pytz.timezone("US/Eastern"))
print ("The time is: ", current_time.strftime('%-I'),":",current_time.strftime('%M'), current_time.strftime('%p'), sep="", end=".")

Converting back to a date format

There will be times where you may have data that looks like datetime format but is a string. You will need to cast or convert these objects to the proper datetime object (i.e. data type). Obviously, your string will need to be in the right format to convert it to a datetime object.

from datetime import datetime

dt_string = "09/08/2020 09:15:31"

# Considering date is in dd/mm/yyyy format
dt_object1 = datetime.strptime(dt_string, "%d/%m/%Y %H:%M:%S")
print("dt_object1 =", dt_object1)
## dt_object1 = 2020-08-09 09:15:31
dt_object2 = datetime.strptime(dt_string, "%m/%d/%Y %H:%M:%S")
# Considering date is instead in the mm/dd/yyyy format
print("dt_object2 =", dt_object2)
## dt_object2 = 2020-09-08 09:15:31

4.4 Pandas

Another library that we will use extensively in this course is called pandas. It is an open-source library providing high-performance, easy-to-use data structures, and data analysis tools for the Python programming language. pandas is already installed on Google Colab. To reference it, we need to use the import statement to import the pandas library. We import it as pd. pd is our pandas object.

Try it.

import pandas as pd

You can use any name you wish, within the rules of python. For example, you could write:

import pandas as tomatoes

The name you choose becomes a pandas object that you can reference to access relevant methods and functions that only work on pandas objects.

Through pandas, you get acquainted with your data by cleaning, transforming, and analyzing it. For example, say you want to explore a data set stored in a .csv file on your computer. CSV stands for comma-separated values. Pandas will extract the data from that .csv into a DataFrame — a table, basically — then let you do things like:

  • Calculate statistics and answer questions about the data, such as

  • What’s the average, median, max, or min of each column?

  • Does column A correlate with column B?

  • What does the distribution of data in column C look like?

  • Clean the data by doing things like removing missing values and filtering rows or columns by some criteria

  • Visualize the data with help from Matplotlib. Plot bars, lines, histograms, bubbles, and more.

  • Store the cleaned, transformed data back into a CSV, other file or database (https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/, para 7)

Check out the pandas API documentation that provides a reference and explanation of all the public pandas objects, functions, and methods. https://pandas.pydata.org/pandas-docs/stable/reference/index.html

4.5 DataFrames

A DataFrame is a two-dimensional tabular, column-oriented data structure with both row and column labels (McKinney, p.4) that comes from the pandas library.

In python, we can import and store data as a DataFrame object.

  • Panda’s IO tools contain a set of readers and writers.
  • Pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json,…).
  • Importing data from each of these data sources are provided by reader function with the prefix pandas.read_*.
  • The corresponding writer functions are object methods that are accessed like DataFrame.to_*()

4.5.1 Readers and writers

The most common readers and writers that we will use include:

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text JSON read_json to_json
text HTML read_html to_html
text XML read_xml to_xml
binary MS Excel read_excel to_excel
SQL SQL read_sql to_sql

For a full list: https://pandas.pydata.org/docs/user_guide/io.html# We can read a .csv file into python using the read_csv() method that comes from the pandas library.

Since we imported pandas as pd, we can call the read_csv method using our newly created pandas object, pd. The read_csv() method takes several parameters, with only one parameter as being required which is the location of the csv file (i.e filepath). We assign the variable mydata to the DataFrame that was created.

See the example below.

DO NOT TRY THIS, YET…

import pandas as pd
mydata = pd.read_csv("../mba_2021.csv") #basic import
type(mydata)
<class 'pandas.core.frame.DataFrame'>
mydata
    Rank  ... Duration (Months)
0      1  ...                12
1      2  ...                21
2      3  ...                21
3      4  ...                21
4      5  ...                21
5      6  ...                22
6      7  ...                18
7      8  ...                16
8      9  ...                21
9     10  ...                24
10    11  ...                21
11    12  ...                18
12    13  ...                22
13    14  ...                17
14    15  ...                21
15    16  ...                12
16    17  ...                21
17    18  ...                22
18    19  ...                12
19    20  ...                18
20    21  ...                22
21    22  ...                16
22    23  ...                18
23    24  ...                20
24    25  ...                24

[25 rows x 11 columns]

You can read the full read_csv() documentation here: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-read-csv-table

With all data files, it’s important to understand how the data is organized. All variables should be columns and all observations should be stored as rows. Usually, csv files have a header. A header is the name of the column names. This is typically the first row of data. Therefore, as best practice, you should specify header=0 (since python starts counting at 0), to indicate that the first row of data contains the variable names.

Using Google Colab to import data in python

Before attempting to replicate the code above, you will need to download the mba.csv file from (http://becomingvisual.com/python4data/mba_2021.csv) to your computer.

Next, implement one of the options below to read in the mba.csv data.

Option 1: Read in file directly from a URL

import pandas as pd
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0)

Option 2: Upload a file and then read it in

##Code to allow you to directly upload your .csv
from google.colab import files
uploaded = files.upload()
type(uploaded) #uploaded is a dictionary
uploaded.keys() #dict_keys(['mba_2021.csv'])

##Modify read_csv to reference the .csv uploaded above
import pandas as pd
import io
mydata = pd.read_csv(io.BytesIO(uploaded['mba_2021.csv']), header=0)

Viewing your data

Once you have imported your data, take a look at it. You can see a preview by typing in the name of your DataFrame mydata.

mydata
    Rank  ... Duration (Months)
0      1  ...                12
1      2  ...                21
2      3  ...                21
3      4  ...                21
4      5  ...                21
5      6  ...                22
6      7  ...                18
7      8  ...                16
8      9  ...                21
9     10  ...                24
10    11  ...                21
11    12  ...                18
12    13  ...                22
13    14  ...                17
14    15  ...                21
15    16  ...                12
16    17  ...                21
17    18  ...                22
18    19  ...                12
19    20  ...                18
20    21  ...                22
21    22  ...                16
22    23  ...                18
23    24  ...                20
24    25  ...                24

[25 rows x 11 columns]

Alternatively, you can use a few functions such as .head() or .tail() to show the beginning and end of the file respectively.

The .head() method by default outputs the first five rows of your DataFrame. You call the methods by using the mydata object followed by a period and then the function name. See below.

mydata.head() # outputs the first five rows
   Rank                         School  ... Total Tuition ($)  Duration (Months)
0     1                         INSEAD  ...             83832                 12
1     2                            LBS  ...            119359                 21
2     3  University of Chicago (Booth)  ...            111855                 21
3     4                           IESE  ...            107049                 21
4     5                           Yale  ...            104752                 21

[5 rows x 11 columns]

.head() outputs the first five rows of your DataFrame by default, but we could pass a number as well: mydata.head(10) would output the top ten rows, for example.

mydata.head(10) # outputs the first ten rows
   Rank                         School  ... Total Tuition ($)  Duration (Months)
0     1                         INSEAD  ...             83832                 12
1     2                            LBS  ...            119359                 21
2     3  University of Chicago (Booth)  ...            111855                 21
3     4                           IESE  ...            107049                 21
4     5                           Yale  ...            104752                 21
5     6         Northwestern (Kellogg)  ...            111658                 22
6     7                          Ceibs  ...            126001                 18
7     8                      HEC Paris  ...            116713                 16
8     9        Duke University (Fuqua)  ...             97554                 21
9    10       Dartmouth College (Tuck)  ...            118047                 24

[10 rows x 11 columns]

To see the last five rows use .tail(). tail() also accepts a number, and in this case, we printing the last three rows.

mydata.tail(3) # outputs the last three rows
    Rank                     School  ... Total Tuition ($)  Duration (Months)
22    23  Indian School of Business  ...             50000                 18
23    24             USC (Marshall)  ...            102964                 20
24    25               WASHU (Olin)  ...             69461                 24

[3 rows x 11 columns]

Getting info about your data

The .info() function provides the essential details about your data set, such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory your DataFrame is using.

mydata.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25 entries, 0 to 24
Data columns (total 11 columns):
 #   Column                               Non-Null Count  Dtype 
---  ------                               --------------  ----- 
 0   Rank                                 25 non-null     int64 
 1   School                               25 non-null     object
 2   Country                              25 non-null     object
 3   AvgSalary                            25 non-null     int64 
 4   PreSalary                            25 non-null     int64 
 5   GradJobs                             25 non-null     int64 
 6   PhD                                  25 non-null     int64 
 7   Avg. Age of Students                 25 non-null     int64 
 8   Avg. work exp. of students (Months)  25 non-null     int64 
 9   Total Tuition ($)                    25 non-null     int64 
 10  Duration (Months)                    25 non-null     int64 
dtypes: int64(9), object(2)
memory usage: 2.3+ KB

A fast and useful attribute is .shape, which outputs just a tuple that shows the number of rows and columns:

mydata.shape
(25, 11)

Notice that we use .shape without closing parentheses. This is because we are accessing an attribute of our DataFrame object. As a reminder, an attribute is a value(characteristic). Think of an attribute as a variable that is stored within an object.

Accessing and renaming columns

Many times data sets will have verbose column names with symbols, upper and lowercase words, spaces, and typos. To make selecting data by column name easier we can spend a little time cleaning up their names.

Here’s how to print the column names of our data set:

The attribute .columns comes in handy if you want to see the column names or rename columns by allowing for simple copy and paste.

We can set a list of names to the columns to rename them.

mydata.columns # show original column names
Index(['Rank', 'School', 'Country', 'AvgSalary', 'PreSalary', 'GradJobs',
       'PhD', 'Avg. Age of Students', 'Avg. work exp. of students (Months)',
       'Total Tuition ($)', 'Duration (Months)'],
      dtype='object')
mydata.columns = ['Rank', 'School', 'Country', 'Average_salary', 'Pre_Salary', 'Grad_Jobs', 'PhD', 'Avg_Age_Students', 
                     'Avg_Work_Experience', 'Tuition', 'Duration']

What if we wanted to make all the column names lowercase? We could repeat what we did above, but there’s an easier way. Instead of just renaming each column manually we can do a list comprehension ( a more advanced concept, but try the code below anyway):

mydata.columns = [col.lower() for col in mydata]
mydata.columns
Index(['rank', 'school', 'country', 'average_salary', 'pre_salary',
       'grad_jobs', 'phd', 'avg_age_students', 'avg_work_experience',
       'tuition', 'duration'],
      dtype='object')

The code [col.lower() for col in mydata] is quite simple. We create a string object named col. You can call it pineapples; the name doesn’t matter. Then we reference the .lower() method from the string class. The for is an iterator that allows us to traverse the column names in the DataFrame and change them to lowercase lettering.

Aside: List comprehensions are used for creating new list from other iterables. As list comprehension returns a list, they consist of brackets containing the expression which needs to be executed for each element along with the for loop to iterate over each element. The square brackets signify that the output is a list.

Accessing and renaming rows

Now that we looked at how to access and rename columns, let’s look at the rows in our data.

To see the names of the rows we look at the index. The index, by default, begins at zero. Therefore the first-row index will be 0.

mydata.index.values will return the names of your index (or rows). Let’s try it.

mydata.index #returns the range of index values
RangeIndex(start=0, stop=25, step=1)
mydata.index.values # returns the index values
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24])

To rename the index, you can 1) set another column to be the index such as Rank' when your read in the data. or 2) Manually assign a list of equal length to the index by updating it using.index`.

Setting another column as the index when you read in data

mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0, index_col="Rank") #set index to Rank. Note since we reimported our data, rank is now in mixed case lettering, not lower case. 

#change to lowercase
mydata.columns = [col.lower() for col in mydata]
print(mydata.index.values)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]
mydata = pd.read_csv("http://becomingvisual.com/python4data/mba_2021.csv", header=0) 
#change to lowercase
mydata.columns = [col.lower() for col in mydata]

mydata.index =mydata["rank"]

print(mydata.index.values)
[ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 25]
print(mydata.loc[1,:]) #prints the first row and all columns
rank                                                  1
school                                           INSEAD
country                                France/Singapore
avgsalary                                        188432
presalary                                            96
gradjobs                                             83
phd                                                  98
avg. age of students                                 29
avg. work exp. of students (months)                  44
total tuition ($)                                 83832
duration (months)                                    12
Name: 1, dtype: object

Updating the index using .index.

mydata = pd.read_csv("http://becomingvisual.com/python4data/mba.csv", header=0) #change to lowercase
mydata.columns = [col.lower() for col in mydata]
mydata.index = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25]
mydata.index.values
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17,
       18, 19, 20, 21, 22, 23, 24, 25])

Note: Values for each row should be unique.

DataFrame slicing by column

Just as we did with Lists in lesson 3, we can extract subsets of a DataFrame.

You can extract a single column from mydata by using square brackets like this:

school_col = mydata['school'] #use lowercase 

When we extract columns using the square brackets the result is a new data structure called a Series.

type(school_col)
<class 'pandas.core.series.Series'>

“A Series is a one-dimensional array-like object containing an array of data (of NumPy data type) and an associated array of data labels, called its index” (McKinney, 2013, p. 112). Series are similar to Lists.

To extract a column as a DataFrame, you need to pass a list of column names. In our case that’s just a single column:

school_col = mydata[['school']].copy() 
type(school_col)
<class 'pandas.core.frame.DataFrame'>

To extract more than one column, simply add it to the List of items to extract.

subset_mydata = mydata[['school', 'rank']].copy() 
subset_mydata.head()
              school  rank
1    Chicago (Booth)     1
2   Dartmouth (Tuck)     2
3  Virginia (Darden)     3
4            Harvard     4
5           Columbia     5

DataFrame extracting by row

For rows, we have two options:

  • .loc - locates the row by name (the name is always the index unless you assign it.)
  • .iloc- locates by numerical index

Let’s try using .loc. Below we use the number 12 to reference the school ranked 13th.

nyu = mydata.loc[12].copy() 
nyu
rank                                        12
school                                  London
country                                Britain
avgsalary                               118514
presalary                                   68
gradjobs                                    93
phd                                        100
avg. age of students                        29
avg. work exp. of students (months)         60
total tuition ($)                        92144
duration (months)                           15
Name: 12, dtype: object

To extract the 13th row, we would use 12 as the index.

nyu = mydata.iloc[12].copy() 
nyu
rank                                                       13
school                                 Pennsylvania (Wharton)
country                                                    US
avgsalary                                              118024
presalary                                                  50
gradjobs                                                   95
phd                                                       100
avg. age of students                                       28
avg. work exp. of students (months)                        55
total tuition ($)                                      107852
duration (months)                                          21
Name: 13, dtype: object

To select rows 3 (index 2) through 5 (index 4), use the colon. Note, just as with Lists, the range is exclusive of the last value (e.g. 2:5). Therefore, to extract index 2 through 4, 2:5 must be the parameter.

afewrows = mydata.iloc[2:5].copy() 
afewrows
   rank             school  ... total tuition ($)  duration (months)
3     3  Virginia (Darden)  ...            107800                 21
4     4            Harvard  ...            107000                 18
5     5           Columbia  ...            111736                 20

[3 rows x 11 columns]

The loc() function is inclusive of the last value in the range.

afewrows = mydata.loc[1:5].copy() 
afewrows
   rank             school  ... total tuition ($)  duration (months)
1     1    Chicago (Booth)  ...            106800                 21
2     2   Dartmouth (Tuck)  ...            106980                 21
3     3  Virginia (Darden)  ...            107800                 21
4     4            Harvard  ...            107000                 18
5     5           Columbia  ...            111736                 20

[5 rows x 11 columns]

Adding a new column to a DataFrame

To add a new column to an existing DataFrame use the square brackets to define the new column and assign the values to the column as shown below.


mydata["newcolumn"] = 0 # every observation for the column will be zero.

4.5.2 Dataframe attributes

Attribute name Description
.shape Returns a tuple that show number of rows and columns
.index Returns the index range
.columns Returns the names of each column
.dtypes Returns the data type for each column

4.5.3 Dataframe Methods

Method name Description
head() Select the first n rows
tail() Select the last n rows
nsmallest() Select and order the bottom n entries
nlargest() Select and order the top n entries
sample() Randomly select n rows
type() Use type(DataFrame) to check the type of object
info() Learn the data types of your columns, number of rows, non-null values
describe() Run summary statistics on your data
sort_values() Order rows by values of column (high to low)
rename() Renames columns of a DataFrame
drop() Drops columns from a DataFrame
copy() Creates a copy of the DataFrame, rather than a slice

4.5.4 Dataframe subsetting techniques

Technique Description
df.["column_name1", "column_name2"] selects one or more columns
df.column_name1 or df["column_name1"] selects a single column
df.at[1,"Poker"] selects row by index and column by name for a single value
df.iat[1, 2] selects row and column by index and position for a single value
df.iloc[:, [1,2,5]] selects all rows and columns by position
df.loc[:, "Slot":"Bac"] selects all rows and columns by name

4.6 Summary

  • Python is an object-oriented programming language (OO). Objects are created from classes. A class defines the methods, data, and attributes related to an object of that class type.
  • The datetime module allows you to retrieve the current date and time and create your own datetime objects.
  • Data files can be read into python as a DataFrame using the read_csv() method from the pandas library.
  • DataFrame is a data structure for storing data in a table-like format
  • DataFrame comes from the pandas library.
  • The key methods for working deriving information about a DataFrame include: .info, .shape, ,.columns, .index, .index.values, and .dtypes.
  • Slicing DataFrames convert the new subset into a data structure of type Series when single square brackets are used as in subset = mydf[1]
  • Use double square brackets (i.e. an index and a list) to subset DataFrames and return objects of type DataFrame as in subset = mydf[[1]].
  • Use the iloc function to extract (or slice) a range of rows from a DataFrame based on the index. The iloc function is exclusive of the last element. Alternatively, use the loc function to extract (or slice) a range of rows from a DataFrame based on the row name. The loc function is inclusive of the last element, however.
  • It’s best to use .copy() when you want to extract a subset of a dataframe independent of the original dataframe.
  • In most cases the row name will be the same as the index. The index can be changed by reassignment using mydf.index or by specifying the index upon reading in the data.

Resources

Python Software Foundation (2019). Introduction. Available at: https://docs.python.org/3/library/intro.html

Python Software Foundation (2019): Functions. Available at: https://docs.python.org/3/library/functions.html

https://www.hackerearth.com/practice/machine-learning/data-manipulation-visualisation-r-python/tutorial-data-manipulation-numpy-pandas-python/tutorial/

Basics of OO http://openbookproject.net/thinkcs/python/english3e/classes_and_objects_I.html

Exercise 4.1

  1. Download the portfolio.csv file from http://becomingvisual.com/python4data/portfolio.csv and import it into Python as a DataFrame named portfolio.
  2. Use the appropriate method to show the data types for each column.
  3. Use the appropriate method to show the number of columns and rows in the portfolio DataFrame. Write a pretty sentence that uses this data that reads: There are 14 rows and 4 columns in the portfolio DataFrame.
  4. Rename the first column in the DataFrame to Security.

Exercise 4.2

  1. You decide to purchase 5 more shares of V, change the number of V shares to 55 (hint: row 9 of the Shares column df.loc[row #, ‘column name’]=new value)

  2. Add a new column to your DataFrame that calculates the’ market_value’ of each security (hint: portfolio.Price*portfolio.Shares)

  3. Use the .sum() function to sum the total of your new market_value column to find the value of your portfolio and call this variable ‘total_value’

Assignment 4

Part 1: Dataframes

  1. Download the windenergy.csv file from http://becomingvisual.com/python4data/windenergy.csv and import it into python. This data breaks down the utility-scale of wind turbines per state in the US.

  2. Print out the top 10 states utilizing the most wind power by rank.

  3. Create a new DataFrame that pulls in State, Ranking, Installed Capacity, Num of Turbines, and Homes Powered

  4. Print out the data for our great state of New York in a pretty sentence.

Part 2. Classes and objects

This question is not based on the data above.

  1. Go to the USGS website: https://eerscmap.usgs.gov/uswtdb/api-doc/#keyValue to learn more about the API they provide.
  1. Use the following code to read in the data from the API.
import requests

url = 'https://eersc.usgs.gov/api/uswtdb/v1/turbines'
results = requests.get(url).json()
results = pd.DataFrame(results)
  1. Create a class called turbines with two default parameters: location and turbine_id. In your class definition, include at least four methods: getLocation, setLocation, getYear, and setYear.

  2. Create 3 turbine objects with the required attributes, and assign the year attribute. Select three turbines records from your results to create your three turbine objects. Provide meaningful output each time you create an instance (object) and set the attributes. Return the values for all attributes for each object. Use the variables t_state for your location attribute, case_id for the turbine_id attribute, and p_year for the year attribute.

References