Lab: Bootstrapping

This lab concerns data about salaries in San Francisco. The dataset, and corresponding information, can be found at https://www.kaggle.com/kaggle/sf-salaries. We will be using a smaller version of the dataset for this lab - you can find this data as SF_Salaries_sub.csv on the course website.

The data

  1. Import the data, and narrow it down to only the 2014 observations. Hints:
  • If you are using read.csv(), you will want to use the option header = TRUE

  • Missing data in this dataset is labeled "Not Provided" or "Not provided" or is blank. You will want to replace these values with NAs. Use the arguments na.strings in read.csv() or na in read_csv.

  1. Our variable of interest for this lab is going to be Total Pay. Plot a histogram of Total Pay with an overlaying density. (Reference the previous lab if you don’t remember how to do this!) Comment briefly on the shape, center, and spread.

  2. Suppose we’re interested in making inference about the typical salary (Total Pay) of all San Francisco city employees in 2014, and this is our representative sample. Is the mean a good statistic to use here to describe the typical value of salary? Why or why not?

  3. Recall that a one-sample t-test requires that the sample mean is approximately Normally distributed. Does this assumption seem reasonable for the mean Total Pay? Why or why not?

  4. Find a 95% confidence interval for the mean Total Pay using the t distribution.

Bootstrapping

  1. Use the bootstrap procedure to construct a 95% bootstrap confidence interval for the mean Total Pay. Compare this interval to your t-interval in (3).

  2. Since the distribution of Total Pay is so skewed, there may be other statistics that are better at describing the typical salary. Write your own function for calculating the following two statistics. (You may have to Google what they mean! Feel free to use relevant code from previous labs.)

  • Midhinge

  • Trimmed Mean (this should take two arguments: the data vector and the percent to trim)

  1. Perform the bootstrap procedure to make 95% confidence intervals for the following statistics for Total Pay:
  • Midhinge

  • 5% Trimmed Mean

  • 10% Trimmed Mean

  • 25% Trimmed Mean

  • Median

  1. For each of the statistics in (8), make a plot with a histogram of bootstrapped values and the 95% confidence interval cutoffs.

  2. Which of these statistics do you think is the best statistic to describe the typical salary? Why? (There is no single write answer to this question. Think about what each statistic is measuring, and decide whether that makes sense for this data.)

Challenge

The variable OvertimePay gives the amount of Overtime Pay earned by the worker in the year.

We would like to know what percent of workers in San Francisco earned some amount of overtime pay in 2014.

Use a bootstrapping approach to create a reasonable estimate for this percentage.