Reshaping and Joining Data Frames


In this lesson, you will learn to:


Time Estimates:
     Videos: 20 min
     Readings: 20-60 min
     Activities: 60 min
     Check-ins: 3



Reshaping Data


Decomposition


Required Reading: Computational Thinking


Please read the Decomposition and Hidden Assumptions sections only - stop when you reach the beginning of the Turn it into a recipe (let’s make an algorithm) section.

Tidy Data and Reshaping


Required Video: Reshaping Data (Pivoting)




Recommended Reading: R4DS 12.1-12.3: Tidy Data



Optional Reading: R4DS 12.4-12.5: Separate/Unite and Missing Values


(A few more tricks for data cleaning/wrangling, if you’re interested.)



Check-In 1: Pivoting


Question 1: Create a new dataset called cereals_3, that has three columns:

  • The name of the cereal

  • A column called “Nutrient” with values protein, fat, or fiber.

  • A column called “Amount” with the corresponding amount of the nutrient.

Question 2: Why didn’t we have to add a rowid to pivot wider in this case?


Check-In 2: Decomposition


Cereals in this dataset are placed on shelf 1, 2, or 3. We would like to know if these cereal placements correspond to different nutritional values; for example, perhaps sugary cereals made for children are on a lower shelf.

Create a new dataset called cereals_4, that has four columns:

  • The name of the manufacturer

  • The mean amount of sugar in cereals on shelf 1.

  • The mean amount of sugar in cereals on shelf 2.

  • The mean amount of sugar in cereals on shelf 3.


Joining data



Required Reading: Mutating Joins



Recommended Reading: R4DS Chapter 13: Relational Data




Check-In 3: Practice with Joins


The following code creates three datasets:

data1 <- data.frame(
  professor = c("Bodwin", "Glanz", "Carlton", "Sun"),
  undergrad_school = c("Harvard", "Cal Poly", "Cal Poly", "Harvard"),
  grad_school = c("UNC", "Boston University", "UCLA", "Stanford")
)

data2 <- data.frame(
  professor = c("Bodwin", "Glanz", "Carlton"),
  Stat_331 = c(TRUE, TRUE, TRUE),
  Stat_330 = c(FALSE, TRUE, TRUE),
  Stat_431 = c(TRUE, TRUE, FALSE)
)

data3 <- data.frame(
  course = c("Stat_331", "Stat_330", "Stat_431"),
  num_sections = c(8, 3, 1)
)

Here is what they look like once created:

##   professor undergrad_school       grad_school
## 1    Bodwin          Harvard               UNC
## 2     Glanz         Cal Poly Boston University
## 3   Carlton         Cal Poly              UCLA
## 4       Sun          Harvard          Stanford
##   professor Stat_331 Stat_330 Stat_431
## 1    Bodwin     TRUE    FALSE     TRUE
## 2     Glanz     TRUE     TRUE     TRUE
## 3   Carlton     TRUE     TRUE    FALSE
##     course num_sections
## 1 Stat_331            8
## 2 Stat_330            3
## 3 Stat_431            1

These datasets contain information about four Cal Poly professors, their educational history, the classes they are able to teach, and the number of sections of each class that need to be assigned.

We’d like to use all three datasets to figure out the best assignment of courses.

Question 1: Warm-up

  1. Combine datasets 1 and 2 to make this:
## Joining, by = "professor"
##   professor undergrad_school       grad_school Stat_331 Stat_330 Stat_431
## 1    Bodwin          Harvard               UNC     TRUE    FALSE     TRUE
## 2     Glanz         Cal Poly Boston University     TRUE     TRUE     TRUE
## 3   Carlton         Cal Poly              UCLA     TRUE     TRUE    FALSE
## 4       Sun          Harvard          Stanford       NA       NA       NA
  1. Combine datasets 1 and 2 to make this:
## Joining, by = "professor"
##   professor undergrad_school       grad_school Stat_331 Stat_330 Stat_431
## 1    Bodwin          Harvard               UNC     TRUE    FALSE     TRUE
## 2     Glanz         Cal Poly Boston University     TRUE     TRUE     TRUE
## 3   Carlton         Cal Poly              UCLA     TRUE     TRUE    FALSE
  1. Combine datasets 2 and 3 to make this:
## Joining, by = "course"
## # A tibble: 9 x 4
##   professor course   can_teach num_sections
##   <chr>     <chr>    <lgl>            <dbl>
## 1 Bodwin    Stat_331 TRUE                 8
## 2 Bodwin    Stat_330 FALSE                3
## 3 Bodwin    Stat_431 TRUE                 1
## 4 Glanz     Stat_331 TRUE                 8
## 5 Glanz     Stat_330 TRUE                 3
## 6 Glanz     Stat_431 TRUE                 1
## 7 Carlton   Stat_331 TRUE                 8
## 8 Carlton   Stat_330 TRUE                 3
## 9 Carlton   Stat_431 FALSE                1

Question 2: Solve the puzzle

We would like to assign teachers for next year, with the following guidelines:

  1. Each professor should be given 4 sections.

  2. At least half of all 331 sections should be taught by Cal Poly graduates.

  3. More than half of all 330 sections should be taught by UCLA graduates.

  4. Each professor should teach at least 2 different classes.

Combine the three datasets into a single data frame that helps you find the answer to the question:

How many sections of Stat 331 will Dr. Glanz teach?