Lab 4: Cereals

The Data

This exam’s dataset concerns nutrition information about breakfast cereals. Documentation for the dataset can be found here: https://www.kaggle.com/crawford/80-cereals/version/2

You may download the dataset at this link, or on the course website.

Part One: Summaries and Plots

(Midterm Review)

Eat your Wheaties!

Most cereals are made from either wheat or oats. We would like to explore the health and taste benefits of these base ingredients.

Task 1

Make a new variable called Seed_Type. This variable should have the possible values wheat, oat, bran, and unknown, depending on if the words “wheat”, “oat”, or “bran” appear in the name of the cereal.

Special Cases: “Wheat ‘n’ Bran” should count as wheat; “Oat Bran” should count as oat.

Task 2

Suppose you are hired by a wheat farming group to convince the world that wheat cereals are better in some way. Make a convincing plot for the superiority of wheat-based cereal. Provide a one-sentence conclusion from your plot.

Task 3

Repeat Task 2, this time assuming you are hired by an oat farming collective, so that your plot should imply that oat cereals are in some way better. Provide a one-sentence interpretation of your plot.

Momma Bodwin’s Rules

Like all children, I loved sugary cereal, especially Lucky Charms. My mother was not pleased about how much cereal my brother and I were eating, so she made a rule: We were only allowed to eat cereal with less than six grams of sugar per cup.

Task 4

Under my mother’s rules, how many cereals in this dataset would we be allowed to eat?

Task 5

The tyrannical rule was defeated when my brother pointed out that my mother’s favorite cereal, Raisin Bran is in fact less healthy than Lucky Charms. Make a plot illustrating this discovery. Your plot should compare calories, fat, sodium, and sugars between the two cereals. These four nutrients should be plotted in separate facets, not as four different plots.

Part Two: Adjust the Data (Function Writing)

Recall from lecture that the nutritional measurements in this dataset (calories, protein, fat, sodium, fiber, carbo, sugars, and potass) are measured per serving. However, not all cereals have the same serving size! In order to fairly compare them, we need to adjust our data first.

This dataset includes a variable called cups, which gives the number of cups in a “single serving” of the cereal. It also includes a variable called weight, which gives the weight in ounces of a single serving.

Task 1

Create a function called adjust_cereal to fix this problem. This function should take in as input:

A vector containing the measurements of a particular nutrient for all the cereals.
A vector containing the cups per serving for all the cereals.
A vector containing the weight per serving for all the cereals.
A string specifying “volume” or “weight”

If the user inputs the string “volume”, the function should adjust the measurements to be per cup instead of per serving.

If the user inputs the string “weight”, the function should adjust the measurements to be per ounce instead of per serving.

You may check your function by comparing your output to the ones below:

test_sugars <- c(100, 200)
test_cups <- c(0.75, 1.25)
test_weights <- c(1.3, 1.8)

adjust_cereal(test_sugars, test_cups, test_weights, "volume")

## [1] 133.3333 160.0000

adjust_cereal(test_sugars, test_cups, test_weights, "weight")

## [1]  76.92308 111.11111

Task 2

Use your function to update the dataset, such that the 8 nutrients (calories, protein, fat, sodium, fiber, carbo, sugars, and potass) are adjusted for the number of cups in a serving (i.e., by volume).

Challenge: What’s in a name? (Regular Expressions)

Come up with a visualization or summary with an interesting insight into cereal names. For example, are there any words or phrases that are associated with higher sugars? With higher ratings? With different shelf placement?

For full credit, make use of regular expressions in your analysis.