6 Working with Messy Data

Real world data is messy. Very often the data file that you start out with doesn’t have the variables stored in the right format for the analysis you want to do. This can range from the way they are named to how they are coded. At times there might be some missing values in your dataset. We learned how to remove missing values or NAs from our data in session 3. In addition, there maybe a time where you only want to analyze a subset of your data. We learned how to subset a dataset in session 3. In summary, when working with real world data you’ll have to do some data manipulation to get it in the format that you need it. In this session we’ll work with some messy data to learn how to format and properly code variables, create simple frequency tables, and review a few graphing techniques to help us better analyze our data.

Session Outline

  1. Messy Data
  2. Renaming Columns (Variable Names)
  3. Attaching / Detaching
  4. Tabulating Data: Constructing Simple Frequency Tables
  5. Ordering Factor Variables
  6. Summary

Watch introductory video

1. Messy Data

Let’s look at a survey questionnaire administered to undergraduate students aimed to understand their preferences for learning specific technologies as part of their education.

View the survey here: https://drive.google.com/file/d/0B3q79e49m3riTGFwU2h6S01NSEk/edit?usp=sharing

Surveys with close-ended questions typically yield nice and neat datasets of results. However, this short survey produced a pretty complex and messy dataset that makes it difficult to work with.

Let’s take a closer look.

Download the dataset of results from: http://becomingvisual.com/rfundamentals/undergrad.csv

Then, import the undergrad.csv dataset into RStudio. As described in session 3, simply select Import Dataset > From CSV File … from the workspace pane.

Then be sure to convert the data to a data frame

undergrad <-data.frame(undergrad)

Next, view the dataset.

View(undergrad)

Carefully review the data. You’ll notice that is has 39 observations and 11 variables. Take note of what you see as messy with this data. Lots of text? The long column names? The missing data? Also, review the columns that correspond to the responses for questions 1 through 4. The questions require a response on a 7-point Likert scale. However, for question 1 the responses are in text form whereas in questions 2, 3, and 4 are numeric. As we saw in the previous lessons, we need our data in a numeric format of sorts to perform any calculations.

2. Renaming columns (variable names)

As you view the data, you probably noticed that the column names are very long, see Figure 6.1.

Imagine having to type The.following.tool.are.important.to.my.future.career… as a variable name!

Viewing the undergrad.csv dataset. Note the long variable names.

Figure 6.1: Viewing the undergrad.csv dataset. Note the long variable names.


It’s difficult to work with a dataset that has long names for variables. Let’s begin with recoding the 11 variable names. We can use the names( ) function to rename each of the columns in undergrad. Just be sure the variable name you type in corresponds to the correct variable. The names need to be written in the order as the columns appear in the data frame.

Create an RScript. Begin by creating an RScript for this lesson. This will help you keep track of your work and repeat commands as necessary. Note: The examples in this lesson will still include the > to indicate an R prompt.

  1. Go to File > New > RScript.
  2. Next, save your RScript as undergrad.R
  3. Rename the undergrad data frame using the names( ) function as described above.
#renaming columns in the undergrad data frame
names(undergrad) <- c("timestamp","excel","access", "statistics", "programming", "iscourse", "cscourse", "topics", "istopics", "onlinecourse", "concentration")
  1. Run your undergrad.R script.
  2. Now, view the undergrad data frame to check to see if the column names were changed.
#viewing the undergrad data frame
View(undergrad)

You should see your columns renamed (see Figure 6.2).

Columns of the undergrad dataset renamed.

Figure 6.2: Columns of the undergrad dataset renamed.


3. Attaching / Detaching

You can make your data manipulation and analysis of the undergrad data frame a little more efficient by using the attach( ) function. The attach( ) function enables you to work with a data frame variables directly without explicitly referencing the name of the data frame. It saves the reference to the data frame in your R path. For example, to view the data in the excel variable of the undergrad data frame you would normally have to enter:

undergrad$excel
##  [1] "Agree"          "Strongly Agree" "Strongly Agree" "Strongly Agree"
##  [5] "Agree"          "Agree"          "Somewhat agree" "Strongly Agree"
##  [9] "Strongly Agree" "Strongly Agree" "Strongly Agree" "Strongly Agree"
## [13] "Strongly Agree" "Somewhat agree" "Strongly Agree" "Strongly Agree"
## [17] "Somewhat agree" "Agree"          "Strongly Agree" "Strongly Agree"
## [21] "Agree"          "Strongly Agree" "Strongly Agree" "Strongly Agree"
## [25] "Agree"          "Agree"          "Strongly Agree" "Strongly Agree"
## [29] "Strongly Agree" "Strongly Agree" "Agree"          "Strongly Agree"
## [33] "Strongly Agree" "Strongly Agree" "Agree"          "Strongly Agree"
## [37] "Strongly Agree" "Strongly Agree" "Strongly Agree"

In the above example to see all the values for the variable excel you would have to type data frame$variable (e.g. undergrad$excel). Instead you can use the attach( ) function and reference any variable in the undergrad data frame directly.

Try using the attach( ) function.

#adding undergrad to my R path for easy access and reference of variables
attach(undergrad)

#referencing the excel variable directly
excel
##  [1] "Agree"          "Strongly Agree" "Strongly Agree" "Strongly Agree"
##  [5] "Agree"          "Agree"          "Somewhat agree" "Strongly Agree"
##  [9] "Strongly Agree" "Strongly Agree" "Strongly Agree" "Strongly Agree"
## [13] "Strongly Agree" "Somewhat agree" "Strongly Agree" "Strongly Agree"
## [17] "Somewhat agree" "Agree"          "Strongly Agree" "Strongly Agree"
## [21] "Agree"          "Strongly Agree" "Strongly Agree" "Strongly Agree"
## [25] "Agree"          "Agree"          "Strongly Agree" "Strongly Agree"
## [29] "Strongly Agree" "Strongly Agree" "Agree"          "Strongly Agree"
## [33] "Strongly Agree" "Strongly Agree" "Agree"          "Strongly Agree"
## [37] "Strongly Agree" "Strongly Agree" "Strongly Agree"

Note: It’s important to use the detach( ) function when you finished working with a particular data frame. For example,

detach(undergrad) # don’t enter this in yet because we are still working with undergrad.

4. Tabulating data: Constructing simple frequency tables and histograms

A typical data analysis task is to construct a frequency table or cross-tabulation of one variable to another. Let’s begin by looking at two variables: iscourse and cscourse. These variables refer to the likeliness that the respondent will take another course in the field of Information Systems or Computer Science. We are selecting these two variables because, as we saw before, the responses to this question were coded as numbers.

To be sure we can check the data type of each variable using the class( ) function.

class(iscourse)
## [1] "integer"
class(cscourse)
## [1] "integer"

These variables are of type “integer”. This is a numeric data type that only contains positive or negative whole numbers. Integers can never contain decimals.

Frequency counts.

A simple frequency count of the number of respondents per response category will help us analyze the results. We can use the table( ) function to tabulate the results of each variable:

table(iscourse)
## iscourse
##  1  2  3  4  5  6  7 
##  6  3  1  8  5 10  6
table(cscourse)
## cscourse
##  1  2  3  4  5  6  7 
##  3  5  2  6  5  7 11

To interpret the results, look at the first row for each response type. This is a scale that includes the integers 1 through 7 which represents likelihood that respondents will take a course in either computer science or information systems. A response of 1 is extremely unlikely, whereas a response of 7 is extremely likely.

Visualizing the results.

In chapter 5 we learned how to create a visualization of a frequency table using the hist( ) function. Let’s try it.

hist(iscourse, breaks=7, main = "Frequencies students interested in another IS course", xlab = "IS course", col="#4cbea3", labels=TRUE, border="#FFFFFF")
Histogram of the likelihood of undergraduate students taking another information systems course.

Figure 6.3: Histogram of the likelihood of undergraduate students taking another information systems course.


hist(cscourse, breaks=7, main = "Frequencies students interested in another CS course", xlab = "CS course", col="#4cbea3", labels=TRUE, border="#FFFFFF")
Histogram of the likelihood of undergraduate students taking a course in computer science.

Figure 6.4: Histogram of the likelihood of undergraduate students taking a course in computer science.


Upon further analysis of the mean, we can see that the respondents report a slightly higher tendency to take a computer science course over an information systems course, but the differences are very small.

mean(iscourse)
## [1] 4.461538
mean(cscourse)
## [1] 4.794872
median(cscourse)
## [1] 5
median(iscourse)
## [1] 5

Here we can see that it’s relatively trivial to do this type of analysis on numeric data.

Let’s return to our survey and look at the first four questions from the survey questionnaire as displayed in Figure 6.5.

An excerpt from the survey questionnaire.

Figure 6.5: An excerpt from the survey questionnaire.


Suppose we want to understand if students’ attitude toward specific technologies was important to their career. Specifically we want to look at the variables excel, access, statistics, and programming in the undergrad dataset.

Let’s examine the second column of data, excel. The data represents the responses from 39 participants to the question:

The following tool is important to my career: Excel

Here you may want to simply know the frequency count of the number of respondents per response category:

Strongly disagree
Disagree
Somewhat disagree
Neither agree or disagree
Somewhat agree
Agree
Strongly agree

We can use the table( ) function to construct a simple frequency table as we did earlier.

table(excel)
## excel
##          Agree Somewhat agree Strongly Agree 
##              9              3             27

Did you notice that the output of the responses only fall within the categories of Agree, Somewhat agree, or Strongly agree? Also, note that the ordering of the table is in alphabetical order. However, it would be easier to interpret the results if we could see them in an order from Strongly Agree to Strongly Disagree.

The data is presented unordered. Check the data type using the class( ) function.

class(excel)
## [1] "character"

You’ll notice that excel is of type character. One way to order the data values correctly and have the ability to plot non-numeric data is to use the factor data type.

Refer back to session 2 for more details on factor variables and ordered factor variables.

Look above at the output of the table(excel) command. Did you notice that there were only three response types given, even though respondent of the survey had seven to select from? This is because there were no values for Strongly Disagree, Disagree, Somewhat Disagree, and Neutral. This means that every student who completed only selected Agree, Somewhat Agree, or Strongly Agree.

Let’s examine the remaining three variables (access, statistics, and programming) and construct frequency tables each of them.

table(access)
## access
##                     Agree                  Disagree 
##                         5                         5 
## Neither agree or disagree            Somewhat agree 
##                         9                        10 
##         Somewhat disagree            Strongly Agree 
##                         2                         8
table(statistics)
## statistics
##                     Agree                  Disagree 
##                        13                         1 
## Neither agree or disagree            Somewhat agree 
##                         2                         7 
##            Strongly Agree 
##                        15
table(programming)
## programming
##                     Agree                  Disagree 
##                        12                         1 
## Neither agree or disagree            Somewhat agree 
##                         5                         6 
##         Somewhat disagree            Strongly Agree 
##                         2                        13

Next, let’s visualize the access frequency table by creating a histogram.

hist(access)
Error in hist.default(access) : 'x' must be numeric

Oops. It looks like we cannot create a histogram with the data when it is a chaacter. To solve this, we can simply “cast” that variable to a different type. In this case, we want to change the data type of access to a factor and cast access using the as.numeric( ) function to plot our histogram.

#reassign access as a factor variable
access <-as.factor(access) 
# cast access as a numeric for plotting. 
figure06<- hist(as.numeric(access), main = "Responses to the level of importance of learning Microsoft Access", xlab = "Bins by reponse category", col="#4cbea3", labels=TRUE, border="#FFFFFF")
A histogram of the access variable in the data frame undergrad.

Figure 6.6: A histogram of the access variable in the data frame undergrad.

figure06
## $breaks
## [1] 1 2 3 4 5 6
## 
## $counts
## [1] 10  9 10  2  8
## 
## $density
## [1] 0.25641026 0.23076923 0.25641026 0.05128205 0.20512821
## 
## $mids
## [1] 1.5 2.5 3.5 4.5 5.5
## 
## $xname
## [1] "as.numeric(access)"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

However, we see in Figure 6.6, that the histogram is not very useful in helping us interpret our data. We can clearly see that the values on the x-axis (our ordinal data) do not necessary match what we’d like to see based on our frequency table. Specifically, it’s unclear what the numbers 1-6 mean on the x-axis. Furthermore, it doesn’t seem as though they are in any specific order. For example, if you look at the first bin in the histogram there are 10 observations. If you recall from earlier, the only value that has 10 observations in the access frequency table is “Somewhat Agree”.


Let’s return back to the frequency table we created for access.

table(access)
## access
##                     Agree                  Disagree 
##                         5                         5 
## Neither agree or disagree            Somewhat agree 
##                         9                        10 
##         Somewhat disagree            Strongly Agree 
##                         2                         8

It’s difficult to see which values correspond to which categories (e.g. agree, disagree, etc.). The next section on ordering factor variables will help us work with our ordinal data in a more organized way.

5. Ordering factor variables

Let’s quickly look at the access variable:

access
##  [1] Neither agree or disagree Disagree                 
##  [3] Somewhat agree            Strongly Agree           
##  [5] Somewhat agree            Neither agree or disagree
##  [7] Disagree                  Somewhat agree           
##  [9] Neither agree or disagree Neither agree or disagree
## [11] Agree                     Disagree                 
## [13] Neither agree or disagree Neither agree or disagree
## [15] Strongly Agree            Neither agree or disagree
## [17] Somewhat agree            Somewhat agree           
## [19] Strongly Agree            Agree                    
## [21] Somewhat agree            Strongly Agree           
## [23] Somewhat agree            Somewhat agree           
## [25] Somewhat agree            Agree                    
## [27] Somewhat disagree         Strongly Agree           
## [29] Somewhat disagree         Disagree                 
## [31] Neither agree or disagree Strongly Agree           
## [33] Strongly Agree            Agree                    
## [35] Neither agree or disagree Somewhat agree           
## [37] Agree                     Disagree                 
## [39] Strongly Agree           
## 6 Levels: Agree Disagree Neither agree or disagree ... Strongly Agree

Note the levels. They are unordered. However, the access factor variable is an ordinal variable. This means that there is a relationship between the levels. For example, agree indicates less agreement than strongly agree.

This is a pretty common data problem in R when importing a dataset. To order a factor variable we can use the ordered( ) function and set the levels in the order we want them.

access_ordered <- ordered(x=access, levels=c("Strongly disagree","Disagree","Somewhat disagree","Neither agree or disagree","Somewhat agree","Agree","Strongly Agree"))

View the data and levels for the access variable.

access_ordered
##  [1] Neither agree or disagree Disagree                 
##  [3] Somewhat agree            Strongly Agree           
##  [5] Somewhat agree            Neither agree or disagree
##  [7] Disagree                  Somewhat agree           
##  [9] Neither agree or disagree Neither agree or disagree
## [11] Agree                     Disagree                 
## [13] Neither agree or disagree Neither agree or disagree
## [15] Strongly Agree            Neither agree or disagree
## [17] Somewhat agree            Somewhat agree           
## [19] Strongly Agree            Agree                    
## [21] Somewhat agree            Strongly Agree           
## [23] Somewhat agree            Somewhat agree           
## [25] Somewhat agree            Agree                    
## [27] Somewhat disagree         Strongly Agree           
## [29] Somewhat disagree         Disagree                 
## [31] Neither agree or disagree Strongly Agree           
## [33] Strongly Agree            Agree                    
## [35] Neither agree or disagree Somewhat agree           
## [37] Agree                     Disagree                 
## [39] Strongly Agree           
## 7 Levels: Strongly disagree < Disagree < ... < Strongly Agree

We can see that the levels represent degrees. For example, the value of Disagree is > (greater than) than the value for Strongly Disagree.

Another way to see the attributes of a variable is to use the attributes( ) function.

attributes(access_ordered)
## $levels
## [1] "Strongly disagree"         "Disagree"                 
## [3] "Somewhat disagree"         "Neither agree or disagree"
## [5] "Somewhat agree"            "Agree"                    
## [7] "Strongly Agree"           
## 
## $class
## [1] "ordered" "factor"

Now that access_ordered is in a logical order, let’s recreate our frequency table.

table(access_ordered)
## access_ordered
##         Strongly disagree                  Disagree 
##                         0                         5 
##         Somewhat disagree Neither agree or disagree 
##                         2                         9 
##            Somewhat agree                     Agree 
##                        10                         5 
##            Strongly Agree 
##                         8

Note: All response options now appear in the frequency table (i.e. Strongly disagree).

Next, create a simple histogram of our access_ordered variable, see Figure 6.7.

figure07 <- hist(as.numeric(access_ordered),breaks=7, main = "Responses to the level of importance of learning Microsoft Access", xlab = "Bins by reponse category", col="#4cbea3", labels=TRUE, border="#FFFFFF")
A histogram of the access variable in the data frame undergrad with the factor variables ordered.

Figure 6.7: A histogram of the access variable in the data frame undergrad with the factor variables ordered.

figure07
## $breaks
## [1] 2 3 4 5 6 7
## 
## $counts
## [1]  7  9 10  5  8
## 
## $density
## [1] 0.1794872 0.2307692 0.2564103 0.1282051 0.2051282
## 
## $mids
## [1] 2.5 3.5 4.5 5.5 6.5
## 
## $xname
## [1] "as.numeric(access_ordered)"
## 
## $equidist
## [1] TRUE
## 
## attr(,"class")
## [1] "histogram"

Let’s compare Figures 6.6 and 6.7 to one another:

par(mfrow = c(1, 2))

hist(as.numeric(access), main = "Figure 6", xlab = "Bins by reponse category", col="#4cbea3", labels=TRUE, border="#FFFFFF")

hist(as.numeric(access_ordered),breaks=7, main = "Figure 7", xlab = "Bins by reponse category", col="#4cbea3", labels=TRUE, border="#FFFFFF")


Finally, you can compute the mean of the access_ordered factor variable by using the mean( ) and as.numeric functions.

mean(as.numeric(access_ordered))
## [1] 4.820513

6. Summary

R Commands and Syntax

  • names( ) sets the name of an object. We used to rename columns in a data frame
  • attach( ) enables a data frame (or any object) to be added to R’s search path for the easy reference of variables without referencing the data frame.
  • detach( ) removes the data frame from R’s search path.
  • table( ) function provides a way to create a frequency table
  • attributes( ) function returns the features of a variable and type.
  • factor( ) a variable type used for categorical data.
  • ordered( ) a function for factors to order factor levels.
  • histogram( ) draws a histogram, numeric data required.
  • as.numeric( ) casts a non-numeric variable into numeric data.

6.1 Exercise 6.1

Create an RMarkdown document to complete the following:

  1. Getting to know the data
  1. Import the data (http://becomingvisual.com/rfundamentals/summer_winter_olympics.csv)
  2. View the data
  3. Look at column names
  4. Look at dimension of data (rows and columns)
  1. Dealing with Data
  1. Look at the column names and change names to more meaningful names.
  2. The data represent, in order:
    1. country
    2. number of summer games played, gold, silver, bronze, total,
    3. number of winter games played, gold, silver, bronze and total, total
    4. total (Winter + Summer) games, gold, silver, bronze, total
  1. Summary
  1. use table() to find frequency of total summer games played
  2. explore the data with other variables
  1. Graphs
  1. do histogram of summer games (total)
  2. do histogram of winter games (total)
  3. put above two histograms on one page
  4. do two histograms on one page: total summer, total winter medals won
  5. is there a correlation between number of medals given out in winter and summer? (do plot)
  6. how about number of games each country competes in. Is there correlation between winter and summer?
  7. look at distribution of each of the types of medals, by season (6 histograms on one page)
  8. redo g with different number of bins (10 instead of 20)
  9. explore data on your own

6.1.1 Code Walkthrough

6.2 Exercise 6.2

Create an R Markdown document to complete the Task below:

Merge the columns for the year 2016 for GDP, Life Expectancy, and Employment into a new data frame and clean-up the new table.

You can find the data here:

Task:

  1. Rename the appropriate columns to “country”, “gdp”, “life_expectancy”, and “employment”.

  2. Convert the employment number to percentages by dividing by 100

  3. Then round life expectancy to zero decimals and employment to two decimals

  4. Create a frequency table for each variable

  5. Draw histograms for each variable

6.2.1 Code Walkthrough

6.3 Assignment 6

Use the undersgraduate survey data from http://becomingvisual.com/rfundamentals/undergrad.csv to create ordered factor variables for the excel, statistics and programming variables. In a R Markdown draw histograms for your new ordered factor variables.