3 R Packages and Scripts

Welcome to session 3. In this session we’ll focus learn more about the extended packages available in R, configure our working space, learn how to import and work with a CSV file, manipulate data frames, develop proficiency in writing R scripts, and document your R code using comments. Finally, you will learn how to produce polished reports and publish them to the web.

Session 3 Outline

  1. Installing and loading packages
  2. Setting up your working directory
  3. Downloading and importing data
  4. Working with missing data
  5. Extracting a subset of a data frame
  6. Writing R scripts
  7. Adding comments and documentation
  8. Creating reports
  9. Summary

Watch introductory video

1. Installing and loading packages.

All of the functions that you’ll want to use in R come in the form of packages. A package is a big collection of functions and other R objects that are all grouped together under a common name.

Some packages are pre-installed in R. To learn which packages are installed in your version of R, type:

library( )

For the data visualization course you’ll be required to use additional packages that are not loaded with R. For statistics you may want to download the psych package for additional statistical functions.

Installing packages

For now, let’s install two packages: psych and ggplot2. Select the packages tab from the lower right frame. This is called the files frame (see Figure 3.1).

A screenshot of the packages tab in RStudio.

Figure 3.1: A screenshot of the packages tab in RStudio.


Next, select the install packages button:


This will bring up a dialog box similar to the one in Figure 3.2. The first drop down asks you from which R repository you would like to download a package. The default Repository(CRAN) will work well for our purposes. CRAN stands for the "Comprehensive R Archive Network" and it is usually easiest to download a package from one of the CRAN mirror sites. Ensure you are connected to the internet when installing packages. Next, enter the package name in the Packages field.

Enter: ggplot2, psych

Keep the rest of the options as they appear by default. Then click the install button.

A screenshot of the install packages dialog box in RStudio.

Figure 3.2: A screenshot of the install packages dialog box in RStudio.


Loading and unloading packages

To use a package that is new to your instance of R or not selected for use by default in RStudio you need to load the package. Simply go to the package tab in RStudio and check the box next to the packages you want to use. Select ggplot2 and psych (see Figure 3.3).

You could also type the command:

library(ggplot2)

In the future this will be useful when you are writing lines of R code that you want to reuse. It will be helpful to know which libraries are being used and called upon that are not loaded by default in RStudio.

Now, when we try to load the psych library at the console, we see R gives us a message that it’s attaching (making active) the psych package, but an object is masked. This may seem confusing, but R is basically saying that %+% is used in both ggplot2 and psych packages. Since psych was the last package loaded, R will default to using the %+% from the psych packages. In some contexts, this could lead to unanticipated results. %+% is used for matrix multiplication (something we probably won’t be using), but will talk about at some point.

library (psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

The find command will tell us where the %+% is used

find("%+%")
## [1] "package:psych"   "package:ggplot2"

Given that we probably want to know which function we are using from which package, it’s best practice to only use packages you need at the time you need them. When you are done using a package you can simple uncheck the package form the packages tab in RStudio or use the following command:

detach("package:ggplot2", unload=TRUE)
A screenshot of the packages tab in RStudio with packages selected.

Figure 3.3: A screenshot of the packages tab in RStudio with packages selected.


Recommended packages

In addition to installing psych and ggplot2, I would recommend installing the following packages: car, reshape2, and lsr. We will use these packages later on in this course.

We will also use other packages that are preloaded in R. Typically, if they are not selected by default, you’ll see a reference to the library( ) function, such as:

library(foreign)

2. Setting your working directory

Pretty soon we will be working with datasets that we download from other sources, such as NYU Open Data.

When working with data, it’s important to stay organized. You want to be able to easily find your files, scripts, graphs, etc. For purpose of this course, create a folder on your desktop (this is outside of RStudio) called mydata (or whatever works for you).

In R, set your working directory to a place where you will store all your data files. In our case, let’s set our working directory to that mydata folder you created on your desktop. This way any time you want to open a file, R will know where to look (and so will you).

You can set your working directory in RStudio by going to Session > Set Working Directory > Choose Directory, see Figure 3.4.

Setting your workspace in RStudio.

Figure 3.4: Setting your workspace in RStudio.


You can also set your working directory using the setwd ( ) function. To learn more type:

?setwd

3. Downloading and importing data

In R, you can import data files that you created in Excel, MiniTab, SPSS, Stata, and many more programs. However, the cleanest and easiest way to import data is as a CSV file. I would recommend you first export your file as a CSV from the program you created the file in, such as Excel.

Many files you will be using in this course will be in .CSV format. This means your file will have an extension .csv as in myfile.csv. "CSV stands for comma-separated values or character-separated values. This file format stores tabular data (numbers and text) in plain-text form. Plain text means that the file is a sequence of characters, with no data that has to be interpreted instead, as binary numbers. A CSV file consists of any number of records, separated by line breaks of some kind; each record consists of fields, separated by some other character or string, most commonly a literal comma or tab. Usually, all records have an identical sequence of fields ((“Comma Separated Values” 2018), para. 4).

Downloading Data

Before we can import a file, we need to have a file to import. Let’s use a file from NYC Open data on Sidewalk Cafes. You can download the dataset from:

http://becomingvisual.com/rfundamentals/Sidewalk_Cafes.csv

Find the Sidewalk_Cafe.csv file that you downloaded (you can probably find it in your Downloads folder on your computer) and move it to your mydata folder on your desktop.

Importing a dataset

Your workspace. In RStudio the upper right quadrant is called your workspace. To import a CSV file simply click on the import dataset button in your workspace.

Select From CSV File… to import a CSV file that is stored on your hard drive.

A screenshot of the Workspace in RStudio.

Figure 3.5: A screenshot of the Workspace in RStudio.


Next, navigate to the Sidewalk_Cafes file that you saved earlier by selecting the Browse button in the Import Text Data dialogue box shown below, see Figure 3.6.

Navigating to your file in RStudio.

Figure 3.6: Navigating to your file in RStudio.


This will launch the Import Text Data window, see Figure 3.7. Here you can set your preferences for how you would like R to read in your .CSV file.

Name. First, you can name your dataset. There is already a default name given. Let’s replace it with sidewalk, as shown in Figure 3.7.

Heading. Select yes since the first row in the dataset contains the column headings. These include Entity.Type, License Number, Sidewalk.Cafe.Type., etc. If you do not select yes to heading, R will create a default header using V’s (e.g. V1, V2, V3). This works well if you do not have a header, but in our case we do and we don’t want R to create a header for us. If we, selected no, our first row of data would contain the column names, such as Entity.Type, etc.

Separator. This is the delimited for your CSV. In this case, the delimiter is a comma. You can see this in the input file window. Other options include whitespace, tab, or a semicolon.

Decimal. If there are decimal points in your data, select period to use the period character for the decimal point. This is the default setting and you can keep it set to period for our purposes. If needed, select comma to encode decimal points into commas.

Quote. If there are single quotes in your data, you can select to have them encoded as double quotes, single quotes, or none. You can keep the default setting.

Once you’ve set your import preferences for this dataset, select import.

Setting import preferences in RStudio

Figure 3.7: Setting import preferences in RStudio

Aside: An alternative way to import a CSV file is to use the read.csv( ) function, instead of the read_csv() function:

sidewalk <- read.csv("Sidewalk_Cafes.csv")

Once you’ve imported your file, you should see it appear in the source window, see Figure 3.8.

The sidewalk CSV file loaded in the source window.

Figure 3.8: The sidewalk CSV file loaded in the source window.


The command to view the file in the source window is:

View(sidewalk)

The source window displays the sidewalk data in a spreadsheet view.

Change the data to a data type called a data.frame.

sidewalk<-data.frame(sidewalk)

We check the data this by using the class( ) function we learned earlier.

class(sidewalk)
## [1] "data.frame"

The data is organized in several rows and columns. The columns are the titled (that’s the header), each with a different variable name. The first variable name is Entity.Type. Let’s use the length( ) function to count how many rows of data we have.

We do this by passing in the data frame name (sidewalk) and the column name (Entity.Type). The $ indicates that the text to follow is a variable name.

length(sidewalk$Entity.Type)
## [1] 1008

There are 1008 values for Entity.Type. This is equivalent to the number of rows or observations in the dataset. This information is also noted for you in the in the source window, return to Figure 3.8.

4. Working with missing data

Let’s do a little more with this data frame. Let’s say you wanted to know the average or the mean square footage for sidewalk cafes in NYC. You can use a simple function called mean( )5 Arithmetic mean to compute this value for the variable Lic.Area.Sq.Ft.

mean(sidewalk$Lic.Area.Sq.Ft)
## [1] NA

However, you can see that R returns the value of NA. We were expecting a number, not NA. This is a clue that our data probably contains a special value, NA.

Let’s just view the sidewalk$Lic.Area.Sq.Ft variable.

View(sidewalk$Lic.Area.Sq.Ft)

If we scroll down in the source window (spreadsheet view) we can see there are some values of NA. The function mean( ) cannot compute the calculation because NA is not a number. Only numbers can be used to compute the mean. This is why the value of NA was returned from R.

We need to remove the NA values or tell R to ignore them when computing the mean. We are going to do this by passing in a parameter to the mean( ) function called na.rm=TRUE. This will indicate that the NA values should be stripped before the computation proceeds.

mean(sidewalk$Lic.Area.Sq.Ft, na.rm=TRUE)
## [1] 258.6475

As you can see, we were able to compute the mean square footage for sidewalk cafes in NYC. About 259 square feet. We can use the round function that we learned earlier to round up from 258.6475 to 259.

round(mean(sidewalk$Lic.Area.Sq.Ft, na.rm=TRUE))
## [1] 259

At this point, you should practice retrieving variables and values from the sidewalk data frame. Return to session 2 to review working with data frames.

5. Extracting a subset of a data frame

Subsetting your data frame is useful if you want to select different variables within the data frame (i.e., keep only some of the columns) or a subset of observations (i.e., keep only some of the rows).

There are few ways to subset your data. I’ll introduce you to 3 simple ways.

Using the $ operator

If you only want to extract one column of your data frame you can use the $ operator. In the example below, we just called the column Address.Zip.Code from the sidewalk data frame and assigned it to a variable called zipcode. I just included a snippet of the values of zipcode.

zipcode = sidewalk$Address.Zip.Code
zipcode
##    [1] 10003 10028 10065 11201 10014 10014 10014 10023 10022 11235 10028 10019
##   [13] 10025 10028 10023 10016 11211 10012 10012 10023 10014 10010 10012 10014
##   [25] 10013 11231 10003 11101 10014 10023 10012 10021 10009 10024 10025 10003
##   [37] 11372 10012 10019 10075 11201 10003 10024 11105 10010 10009 10012 11209
##   [49] 10075 10463 10024 10012 10011 10003 10014 10024 10028 10013 10025 10012
##   [61] 10024 10021 10021 10003 10023 10023 10023 10003 11102 10011 10016 10065
##   [73] 10028 10025 11106 10024 10024 10014 10014 11375 10011 10003 10014 10010
##   [85] 10001 10013 10012 10013 10012 10014 10028 10003 10012 10016 10003 10003
##   [97] 10024 10021 10010 10012 10003 10021 10011 10024 10012 10003 10013 10013
##  [109] 10128 10003 11414 10075 10024 10024 10003 10021 10016 10016 10021 10024
##  [121] 10003 10019 10013 10013 10024 10036 11106 10003 10009 10013 10128 10012
##  [133] 10021 11377 11106 10065 10028 10028 11103 11103 10017 10017 10075 10003
##  [145] 10014 10013 10013 10024 10023 10025 10023 10016 10003 11106 10024 10012
##  [157] 10028 10025 10011 10024 10011 10025 10014 10128 10075 10075 10013 10014
##  [169] 10003 10024 10014 11219 10024 10021 10014 10003 10021 10028 10468 10024
##  [181] 10003 10024 10021 10028 10013 10013 10025 10023 10003 10025 11372 10075
##  [193] 10024 10003 11211 10024 10025 10065 10065 10013 10014 10028 10016 10012
##  [205] 10009 10011 10032 10011 10025 11231 10025 10024 10128 10025 10065 10009
##  [217] 10065 10014 10075 10024 10036 10023 10014 11106 10011 10013 10027 10009
##  [229] 10065 10024 10003 10032 10016 10009 10014 10003 11102 10036 11209 10012
##  [241] 10011 10024 10019 10014 10036 11366 10021 10024 10009 10014 10011 10036
##  [253] 11201 10036 10012 10025 10003 10009 10014 10016 11211 10023 11102 10003
##  [265] 10075 10016 11105 10011 10022 10022 10021 10022 10012 10011 10025 10075
##  [277] 10014 10013 10033 11230 10013 11103 11211 10024 10011 10128 10025 10027
##  [289] 10016 11103 10027 10011 11222 10025 11229 10013 10028 10014 10011 10011
##  [301] 10013 11375 10021 10003 10065 10023 10025 10065 10036 11209 10025 10019
##  [313] 10013 10014 10011 10012 10002 11102 10028 10011 11220 10009 10013 10013
##  [325] 11229 11211 10024 10025 10013 10001 11101 10014 10003 10024 10013 10014
##  [337] 10014 10028 10023 10014 11101 10014 10011 10013 10016 11105 10014 10128
##  [349] 11375 11105 10012 11211 10014 11215 10036 10013 10024 10065 11101 10011
##  [361] 11211 10003 11215 10017 10013 10014 11105 10036 11209 10065 10021 10021
##  [373] 10014 10009 10065 10011 10002 11105 10025 10019 10065 10019 10019 10065
##  [385] 10036 10075 10028 10003 10027 10016 10024 10012 10028 10022 10013 10065
##  [397] 10013 11209 10036 10463 10014 10065 10016 10023 10011 10014 10002 10003
##  [409] 10001 10025 10014 10014 10012 10010 10011 10003 10028 10014 10014 10002
##  [421] 10003 10022 10025 10019 10013 11211 10036 10023 10024 10010 10021 10016
##  [433] 10003 10014 11209 11217 10014 10075 11103 11215 10128 10028 10003 10003
##  [445] 10014 11231 10019 10470 10458 10012 10021 10011 10007 10022 10013 10002
##  [457] 11235 11205 10028 11103 10128 10461 10036 11377 10014 10065 10002 10016
##  [469] 10010 10065 10013 11231 10010 10014 10002 10019 10024 10461 10019 10024
##  [481] 10034 10013 10021 10467 10013 10014 10128 10024 10022 10021 10023 10128
##  [493] 10016 10014 10021 10024 10024 10024 10013 10023 10128 10014 10014 11102
##  [505] 10075 10022 10023 10011 11217 10009 10003 10011 10075 10021 11209 11106
##  [517] 10010 11231 10009 11228 10014 10012 11103 10025 11205 10003 11205 10014
##  [529] 10011 10019 10011 11105 10025 10128 10013 10003 11374 10128 11205 10028
##  [541] 10013 10013 10023 11367 10016 10002 10011 11103 10034 10023 10075 10028
##  [553] 10025 10012 10013 10016 10003 10038 11201 10012 10014 10075 10021 10024
##  [565] 11201 10075 10036 10023 10002 11101 10014 10013 10009 11209 10016 10023
##  [577] 10028 10022 10021 10305 11222 10014 10065 10003 10022 10016 10463 10014
##  [589] 11209 10021 11103 10075 10013 10017 11217 10003 10451 10012 11217 10075
##  [601] 10038 10012 10024 10019 10024 10010 11205 10025 10003 10014 10036 10003
##  [613] 11231 10009 10012 10022 10036 10001 10016 10036 10022 10013 10013 10024
##  [625] 10038 10021 10014 10019 10009 11106 10014 11105 10036 10128 10464 10028
##  [637] 10003 10011 10065 10032 10003 10013 10014 11694 11377 10014 11217 10028
##  [649] 10019 10458 10019 10014 10028 10010 10019 10003 11103 10011 11102 11103
##  [661] 10075 10075 10028 10016 10014 10012 10461 10019 10002 10009 10128 10003
##  [673] 10016 11211 10128 10021 10023 10014 10038 10012 11103 10013 10022 11215
##  [685] 10024 10013 10028 10025 10024 10465 10075 11217 11228 10023 10013 10009
##  [697] 10024 10028 10036 10014 10013 10023 11101 10025 10023 10128 11231 10031
##  [709] 10075 11217 10128 10028 10009 10019 11201 11217 10016 10012 10014 10011
##  [721] 10022 11106 10030 11211 11106 10036 10012 10003 10010 10465 10036 10025
##  [733] 10025 10003 10019 10014 10036 11105 10128 10025 10013 11215 10003 11201
##  [745] 10003 11215 10025 10035 10075 10075 10010 10009 10012 10012 10458 10024
##  [757] 10025 10025 11101 11105 10014 10004 10002 11104 10040 11102 10012 10128
##  [769] 11106 10002 10009 10003 10028 11211 11211 10027 10024 10003 10014 10025
##  [781] 10010 10023 10023 10032 10024 10037 10022 10009 10024 10024 11211 10036
##  [793] 11106 11105 11104 11211 10012 10003 11105 10021 11222 10014 10065 10014
##  [805] 10464 10009 10128 10019 10032 11235 10458 10014 10009 10022 10019 10019
##  [817] 10001 10029 10036 10023 11374 11203 10014 11101 11211 10128 10075 11235
##  [829] 10016 11106 10128 10016 10034 10019 10010 10003 11103 10014 10012 10014
##  [841] 10065 10040 11372 10012 10019 11374 11215 10128 10011 10024 10019 10025
##  [853] 10023 11205 10014 10128 11231 10028 10027 10009 11222 10014 11235 10019
##  [865] 11103 11215 11201 11104 10065 10019 10128 10036 11215 10022 11205 10017
##  [877] 10022 11377 11373 10028 11373 11103 10016 10012 10028 10016 10280 10280
##  [889] 10013 10024 11375 10003 10009 10001 10013 11205 10128 11222 11217 10463
##  [901] 10025 10028 10016 10023 10013 10065 10011 10014 11103 10024 10001 10021
##  [913] 10006 10019 11231 10034 10001 11201 11105 10028 10016 10012 10024 10013
##  [925] 10024 11201 10016 10010 10001 10001 10471 10024 10014 11103 11101 10023
##  [937] 11104 10022 11375 10003 11217 10027 10003 11201 11103 11365 10014 10014
##  [949] 10023 11235 10014 11106 11106 10019 10016 10009 11211 10022 10019 10065
##  [961] 10458 11106 10024 10027 10025 10011 11209 10022 10016 10023 10014 10024
##  [973] 10025 10024 11238 10003 10012 10012 10014 10014 10021 10003 11205 11217
##  [985] 10065 10003 10128 11217 11215 10036 10030 10075 10014 10075 10014 10025
##  [997] 10014 11217 10012 10016 10012 10016 11215 10024 10031 10023 10023 10024

By default, our new variable zipcode is an integer since we assigned it a vector of numbers. If you need to change it to a data frame you can use the data.frame( ) function as shown below.

class(zipcode)
## [1] "integer"
zipcode = data.frame(zipcode)
class(zipcode)
## [1] "data.frame"

Using the subset( ) function

The subset( ) function is an easy way to select particular rows and columns. The function is organized as follows:

subset(x =, subset =, select =)

x. The data frame you want to subset

subset. A vector of logical values indicating which observations (rows) of the data frame you want to keep. By default all rows will be retained.

select. Indicates the variables (columns) you want to keep. By default all columns will be retained. Try to implement the example below. Here we are creating a variable sidewalk_subset to hold the data we are subsetting from sidewalk. In this example, we only want to keep those rows where the zip code is 10012. This is the zip code for NYU Stern. In addition, we want only the names of those locations (Entity.Name) to be returned.

sidewalk_subset <- subset(x = sidewalk, subset = Address.Zip.Code == 10012, select = Entity.Name)
sidewalk_subset
##                             Entity.Name
## 18               LU-ANN BAKERY SHOP INC
## 19                      CAFFE DANTE INC
## 23                 CAVALLACCI, FABRIZIO
## 31                   DYNAMIC MUSIC CORP
## 38      NILO INC.& VIOLA CONSULTING LLC
## 47                         FEENJON CORP
## 52   POMODORO RESTAURANT & PIZZERIA INC
## 60                  DOJO RESTAURANT INC
## 87   172 BLEECKER STREET RESTAURANT,INC
## 89       RESTAURANT VENTURES OF NY,INC.
## 93                 BLL RESTAURANT CORP.
## 100                      NOHO STAR INC.
## 105                     ANDIKIANA CORP.
## 132                   ERJO COMPANY, LLC
## 156                     TRE-GIOVANI INC
## 204     31 GREAT JONES RESTAURANT CORP.
## 240          172 BLEEKER ST. REST., INC
## 255                RDK RESTAURANT CORP.
## 273                       IL BUCO CORP.
## 316           YAMASAK RESTAURANT, CORP.
## 351            MACDOUGAL BLEECKER CORP.
## 392                        SERVICE CORP
## 413               CLAUDISAL REST. CORP.
## 450                       P. M. W. INC.
## 522           IRIDIUM RESTAURANT, CORP.
## 554            G.D.P. ENTERPRISES, INC.
## 560          114 KENMARE ASSOCIATES LLC
## 598                     FOCACCERIA, LTD
## 602                   CAFFE VETRO, INC.
## 615                        247 DELI LLC
## 666                        177 NAP, INC
## 680                      CANTALOUPE LLC
## 718          SENGUPTA FOOD SERVICES LLC
## 727                  THINK BLEECKER LLC
## 753             FGNY 496 LAGUARDIA, LLC
## 754          HALF PINT ON THOMPSON, LLC
## 767                      316 BOWERY LLC
## 797                      265 PASTRY LLC
## 839         BONARUE BLEU INDUSTRIES INC
## 844              GROOVE ENTERPRISES INC
## 884             PASTA BISTRO GRILL INC.
## 922                     PGT REST. CORP.
## 977                        EMILIA, INC.
## 978                  FRIENDLY FOODS LLC
## 999                   333 LAFAYETTE LLC
## 1001                   151 BLEECKER LLC

Notice that the there is a number next to each observation or row. R automatically creates this. The numbers reference the given row from the data frame in which the data was extracted from (e.g. sidewalk).

Using square brackets

We can reference rows and columns of a data frame or vector using [ ]. The format for usage is as follows:

Data frame [rows, columns]

If we wanted the first 2 rows from sidewalk and 5th through 12th variables, Entity.Name through Location.1 we could write the following:

sidewalk[1:2, 5:12]

The colon represents a range. The ranges here are 1:2 (rows 1 through 2) and 5:12 (columns 5 through 12). However, there are times when you want to reference rows and columns that are out of a range sequence. For example, let’s say you wanted to extract rows 1 through 100 for only columns 3 and 5. To do this you need to use the combine function for your rows and columns values.

sidewalk[c(1:100),c(3,5)]

If using numbers for variable (column) names (e.g. 3, 5) become too abstract, you can always pass in the actual name of the column. See the example below.

sidewalk[1:100, c(“Sidewalk.Cafe.Type”, “Entity.Name”)]

Finally, there are times where you want to display all the rows or all of the columns. This can be done by simply leaving the row or column value blank (but keep the comma to separate the row and column parameters). See below for an example that displays all the rows, but only two columns.

sidewalk[, c(“Sidewalk.Cafe.Type”, “Entity.Name”)]

The example below displays all columns, but only the first 100 rows.

sidewalk[1:100, ]

6. Writing R scripts.

Creating an R script is an alternative to typing your R commands in the console window. An R script is simply a text file in which your R commands are stored in a logical sequence.

There are many advantages to writing your R code as an R script file.

  • You can save your work
  • You reuse your code
  • You can document your work
  • You can share your work with others
  • You can move beyond writing one line of code at a time

In RStudio you can easily create an R script by going to the file menu > New > R Script. This will bring up a blank document in the source pane. See Figure 3.9. Save your script as sidewalk_cafes_script.R by going to File > Save As > sidewalk_cafes_script.R.

A new R script file shown in the source panel in RStudio.

Figure 3.9: A new R script file shown in the source panel in RStudio.


In your R script, you’ll notice the number 1. This indicates line 1 in the file. You can begin typing your R commands in this file. Type the commands you see in Figure 3.10. Then save your file by going to File > Save.

R commands written in an R script file.

Figure 3.10: R commands written in an R script file.


The first line in the script uses a function called sum( ) to add up Lic.Area.Sq.Ft column of data. You’ll notice that we passed in na.rm=TRUE to ensure that the NAs were not included in the sum calculation.

Note: These commands will not be executed until you actually run the script.

Running R scripts

There are two ways to run your R script. Note: Any changes made to the script need to be saved before running it.

Option 1. If your script is stored in your working directory you can type the following in the console:

source (“sidewalk_cafes_script.R”)

If you copied and pasted this code and received this error: Error: unexpected input in "source (ð" it is due to the smart quotes that MS Word uses. Use single quotes or replace the double quotes.

However, this method requires you to be very explicit about printing out values to the console. If you ran the command above, you’ll notice the sum and mean were not printed. You would have to modify your file to include the print( ) function. See lines 8 and 12 in Figure 3.11. You’ll notice that we are calling a function within a function. We are passing the output of the sum( ) function to the print( ) function. See Figure 3.12 for the output in the console. This is a shortcut. It’s possible to first compute the sum, then print. Such as:

mysum <- sum(sidewalk$Lic.Area.Sq.Ft, na.rm=TRUE)
print(mysum)
The addition of the print command to enable printing in the console from an R script that is executed using the source( ) function.

Figure 3.11: The addition of the print command to enable printing in the console from an R script that is executed using the source( ) function.


Output from source( ) function.

Figure 3.12: Output from source( ) function.


Option 2. You can use RStudio to execute your code. I prefer this method. You can execute your code line by line by putting the cursor on the line number and pressing the run button in the source window until you reach the end of the script.

To run the full script, highlight the entire script and press run. Try it.

For the rest of this course, we will be using R scripts and R Markdown documents (more on that later) rather the console to execute commands in R.

7. Adding comments and documentation

Comments are used to document your code in R and all programming languages. It is best practice to include comments in your code. This ensures that if someone else tries to read your script they can interpret it. This is also to help you interpret your code when you return to it days, weeks or months later. See Figure 3.12 for an example.

The comment character, # tells R to ignore everything written to the right of the #. We use the # for each line we would use for commenting our code. In RStudio, the color of the text will change to indicate that you are writing a comment. In the example below, the commented text is in blue. The R code is in white. The functions and operators are in orange, and the logical data types are in pink.

Try it. Create a script with comments similar to Figure 3.12.

To execute this script, highlight the text and click the run button.

Once you run the script your output should be similar to Figure 3.13. Notice that the comments also printed in the console.

The output of the R script displayed in the console.

Figure 3.13: The output of the R script displayed in the console.


8. Creating reports

Let’s say you have a dataset that comes in the same format every week, such as a sales report. Wouldn’t it be nice to simply produce a report with every week with a single click, updated with all calculations and visualizations? knitr is an R package that makes it easy to achieve this goal. You can show others what you are doing in R by publishing your code and output to a web page, Word Document, or PDF file.

As an aside - We won’t be covering it here, but LaTeX is a very powerful, open source package, capable of consistently formatting very complex text, such as equations for professional publication. If you want to use knitr to create a PDF document, you will need LaTeX installed.

To create dynamic reports, follow along with the instructions below.

a. Install knitr

Begin by installing the knitr package along with any identified dependencies in RStudio.

b. Modify your RScript

Working with your sidewalk_cafes_script.R add a simple box plot using the code on line 14.

The sidewalk example 01.R file

Figure 3.14: The sidewalk example 01.R file


c. Create a MS Word document from your R script

Click on the “Compile Report” button (You can also chose the “Compile Report” command from the File menu).

You will be presented with a menu to chose your output format (HTML, Word, or PDF. If you chose PDF, you will need to have a specific version of LaTeX installed depending on your operating system, the error message in the console will tell you which version you would need. For now, let’s choose MS Word.

Below, is the compiled notebook from the RScript. This format makes it easy to add explanatory notes. The plot is also created and included. Notice that the output of each command is denoted with ## interspersed with the R code. That is helpful since any line that begins with # is treated as a comment. So anyone who wants to run the code can copy and paste then entire block of text, including the output into the console, and it the code will be executed, but the old output will be ignored.

The output options window shown when compiling a notebook from an R Script.

sidewalk_example_01.R
ksosulsk
Thu May 12 11:13:47 2016

require (readr)
## Loading required package: readr
# Sidewalk Cafe Data Script
# Created by Kristen Sosulski & Michael Sweetman
# May 12, 2016

#modify the path
sidewalk <- read_csv("~/Dropbox/R_Fundamentals/bookdown-minimal-master/Sidewalk_Cafes.csv")
## Parsed with column specification:
## cols(
##   Entity.Type = col_character(),
##   License.Number = col_double(),
##   Sidewalk.Cafe.Type = col_character(),
##   Lic.Area.Sq.Ft = col_double(),
##   Entity.Name = col_character(),
##   Camis.Trade.Name = col_character(),
##   Address.Street.Name = col_character(),
##   Street.Address = col_character(),
##   Address.Location = col_character(),
##   Address.Zip.Code = col_double(),
##   Camis.Phone.Number = col_double(),
##   Location.1 = col_character()
## )
#Check class
class(sidewalk)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
## [1] "data.frame"
#mean of square footage
mean(sidewalk$Lic.Area.Sq.Ft, na.rm='TRUE')
## [1] 258.6475
## [1] 258.6475
#boxplot of the NYC sidewalk cafe size
boxplot(sidewalk$Lic.Area.Sq.Ft, main="NYC Sidewalk Cafe Size")

d. Creating an R Markdown File

The above example demonstrated how to publish your R script in a readable format, but it still has some drawbacks. Suppose you want to create a report that updates automatically? The above method will still be easier than manually creating the report each time, but formatting and adding comments will still have to be done manually every time the report is run. Creating an R Markdown file will allow us to automate the process of data manipulation, report creation, and formatting.

Try it – A short activity

To get started go to the File menu, click New File > R Markdown as shown in Figure 3.15.

(Note: The first time you do this you may be prompted to install several packages.)

The dialogue box prompt for creating a new R Markdown document.

Figure 3.15: The dialogue box prompt for creating a new R Markdown document.


Title and Author: Enter a name for the file, and your name. Select HTML as the default output format. Click OK.

RStudio will create a Markdown file. This document will allow you to create a file using simple plain text, which can then be easily converted into HTML, Word, or PDF. You can easily embed your R code and output. The file that RStudio creates will also be pre-populated with example Markdown syntax file using one of the sample data sets embedded in R.

The following is the example Markdown file:

A sample Markdown file.

Figure 3.16: A sample Markdown file.


To see the HTML file, select

The output is shown in Figure 3.17 below.

The R Markdown HTML document.

Figure 3.17: The R Markdown HTML document.


Some Basic R Markdown Syntax

Markdown allows you to create webpage automatically from a plaintext base. The R Markdown implementation is based on the Markdown standard, but unfortunately the Markdown language is somewhat fragmented. Therefore, the syntax used in R Markdown will likely differ from any other implementation of Markdown that you come across.

The purpose of the specialized syntax in the R Markdown file is to allow the translator to identify the intended use of the text, such as a heading, bold font, R code, etc. R code is embedded in the markdown file in “chunks.” The default option includes these chunks in the report. This option can be turned off with the “echo=off” option. Below are some useful syntax examples.

Document Heading
---
heading text
---

Bold
**text**

Italic
*text*   

List
* Item 1
* Item 2
    + Item 2a
    + Item 2b

R Code
'''{r}
place the R Code here
'''

R Code With Options
'''{r echo=FALSE}
echo=FALSE will output the result of the code without outputting the code itself.  Another common option is eval=FALSE, which puts the code in the report without the output of the code.
'''

Inline R Code
You can place R code in the middle of the line like this. 'r place code here'

That’s just a bit to get you started, there are many other options, and they are detailed at the RStudio Markdown page.

An R Markdown Example

Let’s say that New York City wants to track the square footage trend of enclosed versus. unenclosed sidewalk cafes. They want a webpage that will show everyone the current distribution of square footage of enclosed vs. unenclosed sidewalk cafes, as well as some of the code that was used to perform the calculations. We’ll create the entire report using an R Markdown file.

Start off with the Markdown file we opened earlier. Delete everything except the title block, which you can modify to your liking, see Figure 3.18. Below the three dashes that end the title block is where we add our text. Let’s put in a quick line of introduction, for that we can use plaintext. It may be a good idea to show when the report was generated, so we can use the inline R code call discussed above to add the date. The function r date( ), will return the current date as a character string. Let’s add that to our introductory text. See below:

A markdown document in R.

Figure 3.18: A markdown document in R.


To generate your HTML, Word, or PDF file, you just click on the Knit button that you see in the above image. Below is the generated report. Notice the date of the execution has been placed in the report. The inline code execution option can be particularly useful for placing summary statistics into the narrative description.

The HTML output of the R markdown file.

Figure 3.19: The HTML output of the R markdown file.


The next step is to import the sidewalk cafe data. Since this is a trivial step for this report we will not include or “echo” this code into the report.

You can begin by inserting an R code block or chunk into the Markdown file, using the syntax given in the previous section.

You could also use chunks with the Insert button to save some typing.

Inside the code chunk, type the command to import the data. You can test your R code execution from the drop down menu on the Chunks button. You can see the results of executing the code chunk in Figure 3.19. The import command was executed, and the sidewalk data frame is now created.

A code chunk highlighted in RStudio.

Figure 3.20: A code chunk highlighted in RStudio.


Now we have to process the data, specifically we want to split the sidewalk café data set into two sets, representing the enclosed and unenclosed data. This code is a little more complex, so perhaps we want to add a line or two of explanation before this step. You can see the narrative for the report, and the second code chunk in the screenshot of the Markdown file below.

And below is the additional portion of the report. Remember, we echoed back the code this time. Since subsetting the data frame doesn’t produce any visual output, there is no evaluation of the code to add to the report, but the code is executed nonetheless.


The data was split into two data frames containing the enclosed and unenclosed data using the following code.

enclosed <- subset.data.frame(sidewalk, sidewalk$Sidewalk.Cafe.Type == 'Enclosed')

unenclosed <- subset.data.frame(sidewalk, sidewalk$Sidewalk.Cafe.Type == 'Unenclosed')


The final step for this report is to produce the boxplot. Again, we will insert some descriptive narrative, and show the code used to produce the plot. However, since this time there is output from the execution of the R code, the visualization will also be added to the report. The final report, encompassing the entire Markdown file, is shown on the next page.


Take it one step further.

Try to replicate the additional report “Summary” output produced below.

Publishing to RPubs

You can publish your .R and .Rmd documents to RPubs. After you run Knit HTML, you will see an option to publish. Select, Publish and the RPubs dialog box will appear. Select Publish. Next, choose RPubs.

Then you will be presented with an option to login or create an account on RPubs.

After you create an account, provide your document with a title.

Then you will be provided with a URL you can share and update. The published document is below and available at: http://rpubs.com/sosulski/180484

This section introduced the KnitR and R Markdown modules in R, and demonstrated some of the basic functionality. There is much more functionality than was demonstrated here, for example Markdown can also easily add tables to reports and there are many additional formatting options. There are other modules, Shiny, for instance, that can be used to add interactivity.

9. Summary

This objective of this session was to expand your use of R beyond typing commands at the console. You learned how to import and use other R packages. You imported an external dataset into R and extracted data from it. Finally, you created an R script and documented it using comments.

R Rules and syntax

  • R comes pre-loaded with many packages.

  • Some packages use the same names for functions as used in other packages. When installing new packages ensure that you only enable them when necessary. Use the detach( ) to deactivate the use of packages.

  • Setting up your working directory will help you organize your files and easily reference your files in R.

  • Many different file formats can be imported into R. Some formats include: CSV, Excel, SPSS, Minitab, TXT, and SAS.

  • Use data.frame( ) to cast your data set as a data frame after importing it in RStudio

  • NA values can be not be computed. Be sure to remove or omit NA values from your data before performing computations.

  • R scripts are an efficient way to code in R

  • Commenting your code with explanatory text can help you and others interpret your code.

  • Specific data can be selected from a data frame use the $ operator, the subset( ) function, or square brackets [ ].

R commands

  • rm(list=ls( )) removes all objects from your R environment
  • library( ) returns all of the packages installed in your version of R. When a package is included as a parameter e.g. library(foreign) this loads the package for use
  • detach( ) unloads a package from use
  • find( ) function returns the library origin of the object, function, operator, etc. of the parameter
  • read_csv( ) imports a CSV file into R
  • setwd( ) function sets the working directory in R
  • na.rm=TRUE parameter removes all NAs from a function before the function is executed.
  • sum( ) function returns the sum or total
  • mean( ) function returns the arithmetic mean
  • source( ) function runs an R script
  • print( ) is a function to explicitly print output to the console.
  • # are used for single line comments
  • $ operator is used to extract a variable from a data frame.
  • subset(x =, subset =, select =) is used for extracting certain rows and/or columns from a data frame
  • square brackets can be used for extracting specific rows and columns.

3.1 Exercise 3.1

We will be using basketball data from March Madness for this exercise.

Data

Download from: http://becomingvisual.com/rfundamentals/march_madness.csv

Data Dictionary

Variable Description
Rank Team Ranking
Previous Previous Team Ranking
School Name of the College or University
Conference NCAA Conference (30 +)
Record Overall Record
Neutral Record with games in a neutral location
Home Record with games at home
Non Div I Record with non-divison 1 games

Write a R script to do the following:
(Remember to add comments)

  1. set working directory -- Hint: setwd()

  2. import the csv file

  3. view the file

  4. print number of rows and columns -- Hint: dim()

  5. print columns names

  6. change column names to lower case so it is easier to use Hint: names(df_name) <- tolower(names(df_name))

  7. explore the variable types. -- Hint: str()

  8. how many different conferences are there?

  9. Let’s look at the difference in values of first two columns:
  1. compute a new vector called "diff" and calculate the difference in rank and previous
  2. print count and list of schools that changed 3 or more places Hint: create subset that satisfies criteria

3.1.1 Code walkthrough

3.2 Exercise 3.2

  1. Import the GDP dataset and compute the difference in GDP between 2007 and 2017 for each country. Download from http://becomingvisual.com/rfundamentals/gdp.csv

Task:

  1. Create a subset of countries that saw an increase of over one trillion dollars.

Tips:

  • Import the package readr to use the read_csv() function
  • When selecting the column for each year, use double quotes (or back ticks) around the number. For example, use gdp$"2017".

3.2.1 Code Walkthrough

3.3 Assignment 3

The exercise described below will prepare you to work with external datasets in R. We will be working with the data on the usage of stimulus/recovery funds provided through the American Recovery and Reinvestment Act of 2009 (ARRA) from NYC Open Data. Please complete this exercise prior to moving on to the next lesson. Follow the instructions below. Submit a Word version or PDF of your RMarkdown (with input and output shown) for this assignment.

  1. Create a new RMarkdown in RStudio named fed_stimulus.Rmd. Include your name and date
  2. Go to NYC Open Data and export the Federal Stimulus dataset as a CSV file from https://data.cityofnewyork.us/Business/Federal-Stimulus-Data/ivix-m77e
  3. Review the details of the variables included in the dataset by selecting the manage button on the NYC Open Data site for the Federal Stimulus data.
  4. Move the Federal_Stimulus_Data.csv file to your mydata (or working directory) on your desktop
  5. Write the code in your Rmarkdown to import Federal_Stimulus_Data.csv. Change the name of the data frame from Federal_Stimulus_Data to fed_stimulus
  6. Compute the sum and mean for the payment value column
  7. Create a subset of your data that returns those projects with project status is equal to the completed 50% or more. Do not include fully completed projects.
  8. Review your R Script and add appropriate explanatory comments and notes.
  9. Create a knitr report. Either knit to Word or PDF. This is what you will need to submit. Check that you include both the input and output for each item above.

References

“Comma Separated Values.” 2018. WikiPedia. http://en.wikipedia.org/wiki/Comma-separated_values.