Lab: Baby Names

The Data

Our dataset in this lab concerns baby names and their popularity over time. The dataset babies_ca has information about baby names in California from 1940-2016. For each year, and for each name with at least 50 recorded babies born, we are given the counts of how many babies had that name.

At this link, you can find the names for ALL 50 states, in separate datasets organized by first letter.

Is my name not cool any more?

Let’s take a look at how the name “Kelly” has changed over time.

  1. Make a plot showing the number of babies named “Kelly” born each year in CA.
    Color the plot to show the gender assigned at birth to these babies.

  2. I was born in 1989. Let’s look at only the time frame since I was named. Narrow down the dataset to only 1989 onward.

  3. Create a linear model with the year as the explanatory variable, and the number of Kellys as the response. Summarize the results, and plot the model.

  4. Plot the residuals: that is, the actual values minus the predicted values.
    (The function add_predictions in the modelr package may be useful to you.)

Comment on the residuals - do you see any patterns?

  1. Now include Gender in the model. Plot the model and the residuals, and comment on any patterns in the residuals.

  2. What do you conclude from this model?

Spelling by state

I used to hate it when people would spell my name as “Kelli” or “Kelley”. But I don’t have it as bad as my good friend Allan.

  1. Narrow the California dataset down to only male-assigned babies named “Allan”, “Alan”, or “Allen”. Make a plot comparing the popularity of these names over time.

  2. In California, Allan’s spelling of his name is the least common of the three - but perhaps it’s not such an unusual name for his home state of Pennsylvania. Compute the total number of babies born with each spelling of “Allan” in 2000, in Pennsylvania and in California.

  3. Convert your total counts to overall percents. That is, what was the percent breakdown between the three spellings in CA? What about in PA?

  4. Perform a Chi-Square test on this data, to determine if CA and PA have different distributions of spellings of “Allan”.


Perform an analysis of your own, using any names you choose! Your analysis must include a plot, a hypothesis test or regression model, and accompanying clear discussion.