Regular Expressions


In this lesson, you will learn to:


Time Estimates:
     Videos: 20 min
     Readings: 0-20 min
     Activities: 90 min
     Check-ins: 4



Regular Expressions


Required Video: Regular Expressions




Recommended Reading: R4DS: String Matching with Regular Expressions




Check-In 1: A few more symbols


Question 1:

Recall that the regular expression [abc] matches the characters a, b, or c.

What does [^abc] match?

Question 2:

When it is not inside square brackets, the ^ symbol means “start of string”.

What will be returned by the following?

my_str <- c("Kelly", "Hi Kelly", "Kelly Bodwin", "Who is Kelly?")
str_subset(my_str, "^Kelly")

Question 3:

THe $ symbol in a regular expression means “end of string”.

What will be returned by the following?

my_str <- c("Kelly", "Hi Kelly", "Kelly Bodwin", "Who is Kelly?")
str_subset(my_str, "Kelly$")


Check-In 2: Simple Regular Expressions


What will the following outputs be?

my_str <- "The Dursleys of Number 4 Privet Drive were happy to say that they were perfectly normal, thank you very much."

str_extract_all(my_str, ".*")

str_extract_all(my_str, "\\w")

str_extract_all(my_str, "\\s")

str_extract_all(my_str, "[:alpha:]+")

str_extract_all(my_str, "[:alpha:]*\\.")

str_extract_all(my_str, "[wv]er[ey]")


Check-In 3: Complex Regular Expressions


my_str <- "The Dursleys of Number 4 Privet Drive were happy to say that they were perfectly normal, thank you very much."

str_extract_all(my_str, "[:digit:] ([A-Z][a-z]*)+")

str_extract_all(my_str, "(?<=[:digit:] )[:alpha:]+")

str_extract_all(my_str, "[:digit:].*Drive")

my_str %>%
  str_split() %>%
  str_extract("^[A-Z]")


Check-In 4: Text Analysis with Regular Expressions


The file hamlet_speech.txt, posted on the course sit, contains the text of a famous speech from the play “Hamlet” by Shakespeare. Download this file and save it somewhere reasonable. Read it into R with:

hamlet <- readLines("hamlet_speech.txt")

Answer the following:

  • How many words are in the speech? (Hint: str_count)

  • How many times does Hamlet reference death or dying?

  • How many sentences are in the speech?

  • What is the longest word in the speech?

  • What is the only capitalized word that does not start a sentence or line?

Hint: Right now, your object is a vector of type character, where each element is a line of the speech. You may want to use str_c() (with appropriate arguments) to turn this into a single string. You may also want to turn it into a vector where each element is one word.

Or you may want to do all three! Different tasks will be easier with different object structures.