Stringr and Regular Expressions

class: center, middle, inverse, title-slide

.title[
# Stringr and Regular Expressions
]
.subtitle[
## PHC 6701 - Summer 2022
]
.author[
### Catalina Cañizares
]
.date[
### Jul 14, 2022
]

---

# Overview

## We will cover the following:

+ String manipulation.

+ `str`_functions.

+ Regular Expressions.

---

# Prepare the following packages for this lecture:

```r
install.packages("learnr")
library(learnr)
library(palmerpenguins)
library(tidyverse)
```

.pull-left[
.center[![:scale 40%](images/tidyverse.png)]
]

.pull-right[
.center[![:scale 40%](images/palmer_penguins.png)]
]

.center[![:scale 40%](images/learnr.png)]

---

# What are strings ?

Words or characters.

.center[![:scale 50%](images/stringr_sticker.png)]

```r
string <- c( "It is just english", "inside a vector")
string
```

```
[1] "It is just english" "inside a vector"   
```

---
# Why should you care?

Manipulating strings usually is part of the Wrangling process

.center[![:scale 80%](images/process.png)]

---
# `stringr` Package?

+ `stringr` version 1.4.0 has a total of 53 functions.

--
+ 41 of them start with `str_`

```r
stringr::str
```
.center[![:scale 50%](images/str_functions.png)]

---
# We are not covering all...

.center[![:scale 80%](images/learning_curve.png)]

But we will cover the most often used functions
---

# The functions

.center[![:scale 50%](images/str_notes.png)]

```r
str_legth() 
str_c()
str_sub()
str_split()
str_subset()
str_extract()
str_match()
str_replace()
str_replace_all()
str_to_lower()
str_to_upper()
str_to_title()
```

---
# Examples

**str_length()**. 
Counts the number of "code points", in a string

```r
names <- c("Gabriel", "Anny", "Catalina")
```

```r
str_length(names)
```

```
[1] 7 4 8
```

--
**str_c()**. 
Joins two or more vectors into a single character vector,

```r
sep_sentence <- c("Hi everyone", "I want this to be", "only one sentence", "HELP!")
sep_sentence
```

```
[1] "Hi everyone"       "I want this to be" "only one sentence"
[4] "HELP!"            
```

```r
one_sentence <- str_c(sep_sentence, collapse = ", ")
one_sentence
```

```
[1] "Hi everyone, I want this to be, only one sentence, HELP!"
```

---
# Examples

--
**str_sub**. 
Extract substrings from a character vector, from = start, to = end

```r
sense <- ("This makes no sense")
str_sub(sense, 1,10)
```

```
[1] "This makes"
```

```r
str_sub(sense, 15)
```

```
[1] "sense"
```
--
**str_subset**() Keep strings matching a pattern,

```r
more_sense <- c("This makes no sense", "although", "I am", "understanding")
str_subset(more_sense, "u")
```

```
[1] "although"      "understanding"
```

```r
str_subset(more_sense, "c")
```

```
character(0)
```

---
# Examples

--
**str_split()**. 
Split up a string into pieces

```r
surnames <- c("Ferdus Tarana", "Rodriguez Christofer", "Salehe Said")
surnames
```

```
[1] "Ferdus Tarana"        "Rodriguez Christofer" "Salehe Said"         
```

```r
str_split(surnames, pattern = "," , simplify = TRUE)
```

```
     [,1]                  
[1,] "Ferdus Tarana"       
[2,] "Rodriguez Christofer"
[3,] "Salehe Said"         
```

--
**str_extract**() Extract matching patterns from a string

```r
project2 <- c("12L", "52M", "1B", "nonsense2", "12", "more 2 think")
str_extract(project2, "\\d")
```

```
[1] "1" "5" "1" "2" "1" "2"
```

```r
str_extract(project2,  "\\d+")
```

```
[1] "12" "52" "1"  "2"  "12" "2" 
```
---
# Examples

--
str_replace() Replace matched patterns in a string

```r
typos <- c("kansas", "kansas", "kansas", "kansas", 
 "k", "hansas", "k")

str_replace(typos, pattern =  "kansas", replacement = "k")
```

```
[1] "k"      "k"      "k"      "k"      "k"      "hansas" "k"     
```

---
# Examples

--
str_to_lower()

```r
capital <- c("MY OBJECT", "ONLY HAS", "CAPITAL LETTERS", "UNTIL", "...")

lower_case <- str_to_lower(capital)
lower_case
```

```
[1] "my object" "only has" "capital letters" "until" 
[5] "..." 
```
--
str_to_upper()

```r
str_to_upper(lower_case)
```

```
[1] "MY OBJECT" "ONLY HAS" "CAPITAL LETTERS" "UNTIL" 
[5] "..." 
```
--
str_to_title()

```r
str_to_title(capital)
```

```
[1] "My Object"       "Only Has"        "Capital Letters" "Until"          
[5] "..."            
```

---
class: segue

Regular Expressions

---
# Rexexps
.center[![:scale 100%](images/cat.png)]

---
# For real...Regexp

Are a concise and flexible tool for describing patterns in strings.

Examples:

```r
fast_food <- c("hamburgers", "(hamburgers9", "1hot dog", "(fries12", "2sandwich", ".")
```
--

```r
str_extract(fast_food, "h")
```

```
[1] "h" "h" "h" NA  "h" NA 
```
--

```r
str_extract(fast_food, "^h")
```

```
[1] "h" NA  NA  NA  NA  NA 
```
--

```r
str_extract(fast_food, "h$")
```

```
[1] NA  NA  NA  NA  "h" NA 
```
--

```r
str_extract(fast_food, "\\d+")
```

```
[1] NA   "9"  "1"  "12" "2"  NA  
```

---
# Types of Regular Expressions 😄😿 😠
There are **three** basic types of regular expressions:

--
1. Regular expressions that stand for individual symbols and determine frequencies

--
2. Regular expressions that stand for classes of symbols

--
3. Regular expressions that stand for structural properties

.center[![:scale 60%](images/cat2.png)]

---
# Escaping... 🚀

```r
fast_food
```

```
[1] "hamburgers"   "(hamburgers9" "1hot dog"     "(fries12"     "2sandwich"   
[6] "."           
```

The last character in our string is `"."`

How do you match a literal `"."` or `"("`?

```r
str_extract(fast_food, ".")
```

```
[1] "h" "(" "1" "(" "2" "."
```
---

# Escaping... 🚀  
--

```r
str_extract(fast_food, "\\.")
```

```
[1] NA  NA  NA  NA  NA  "."
```
--

```r
str_extract(fast_food, "\\(")
```

```
[1] NA  "(" NA  "(" NA  NA 
```
--

```r
str_extract(fast_food, c("\\.|\\(" ))
```

```
[1] NA  "(" NA  "(" NA  "."
```
--
You need to use an “escape” to tell the regular expression you want to match it exactly
not use its special behavior.

To create the regular expression `"\."` you need the string `"\\."`

---

## Regular expressions that stand for individual symbols and determine frequencies.

| RegEx Symbol/Sequence | Explanation                                           |
| --------------------- | ----------------------------------------------------- |
| \*                    | The preceding item will be matched zero or more times |
| +                     | The preceding item will be matched one or more times  |
|\.                     |The Wild card, takes the place of anything

---
## Regular expressions that stand for individual symbols and determine frequencies.
Examples
--

```r
str_extract(fast_food, "ham.+")
```

```
[1] "hamburgers"  "hamburgers9" NA            NA            NA           
[6] NA           
```
--

```r
str_extract(fast_food, ".+")
```

```
[1] "hamburgers"   "(hamburgers9" "1hot dog"     "(fries12"     "2sandwich"   
[6] "."           
```
--

```r
str_replace(fast_food, "\\(", "")
```

```
[1] "hamburgers"  "hamburgers9" "1hot dog"    "fries12"     "2sandwich"  
[6] "."          
```

---
## Regular expressions that stand for classes of symbols.
| RegEx Symbol/Sequence | Explanation                                                       |
| --------------------- | ----------------------------------------------------------------- |
| \[ab\]                | lower case a and b                                                |
| \[a-z\]               | all lower case characters from a to z                             |
| \[AB\]                | upper case a and b                                                |
| \[A-Z\]               | all upper case characters from A to Z                             |
| \[12\]                | digits 1 and 2                                                    |
| \[0-9\]               | digits: 0 1 2 3 4 5 6 7 8 9                                       |
| \[:digit:\]           | digits: 0 1 2 3 4 5 6 7 8 9                                       |
| \[:lower:\]           | lower case characters: a–z                                        |
| \[:upper:\]           | upper case characters: A–Z                                        |

---
## Regular expressions that stand for classes of symbols.
Examples

```r
str_extract(fast_food, "[ab].+")
```

```
[1] "amburgers"  "amburgers9" NA           NA           "andwich"   
[6] NA          
```
--

```r
str_extract(fast_food, "[1-2].+")
```

```
[1] NA          NA          "1hot dog"  "12"        "2sandwich" NA         
```
--

```r
str_extract(fast_food, "[:digit:]")
```

```
[1] NA  "9" "1" "1" "2" NA 
```

---

## Regular expressions that stand for structural properties.

| RegEx Symbol/Sequence | Explanation                            |
| --------------------- | -------------------------------------- |
| \\\\w                 | Word characters: \[\[:alnum:\]\_\]     |
| \\\\W                 | No word characters: \[^\[:alnum:\]\_\] |
| \\\\s                 | Space characters: \[\[:blank:\]\]      |
| \\\\S                 | No space characters: \[^\[:blank:\]\]  |
| \\\\d                 | Digits: \[\[:digit:\]\]                |
| \\\\D                 | No digits: \[^\[:digit:\]\]            |
| ^                     | Beginning of a string                  |
| $                     | End of a string                        |

---

## Regular expressions that stand for structural properties.
Examples

```r
str_extract(fast_food, "\\d")
```

```
[1] NA  "9" "1" "1" "2" NA 
```
--

```r
str_extract(fast_food, "\\W")
```

```
[1] NA  "(" " " "(" NA  "."
```
--

```r
str_extract(fast_food, "g$")
```

```
[1] NA  NA  "g" NA  NA  NA 
```

---

## Finally, let's clean that horrible fast_food string

count: false

.panel1-my_food-auto[

```r
*  tibble(fast_food)
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 1
 fast_food 
 <chr> 
1 hamburgers 
2 (hamburgers9
3 1hot dog 
4 (fries12 
5 2sandwich 
6 . 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
*  rename(food = fast_food)
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 1
 food 
 <chr> 
1 hamburgers 
2 (hamburgers9
3 1hot dog 
4 (fries12 
5 2sandwich 
6 . 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
   rename(food = fast_food) %>%
*  mutate(
*    clean = str_remove_all(
*      food, "[:digit:]")
*   )
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 2
 food clean 
 <chr> <chr> 
1 hamburgers hamburgers 
2 (hamburgers9 (hamburgers
3 1hot dog hot dog 
4 (fries12 (fries 
5 2sandwich sandwich 
6 . . 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
   rename(food = fast_food) %>%
   mutate(
     clean = str_remove_all(
       food, "[:digit:]")
    ) %>%
*  mutate(
*    clean = str_remove_all(
*      clean, "\\(")
*   )
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 2
 food clean 
 <chr> <chr> 
1 hamburgers hamburgers
2 (hamburgers9 hamburgers
3 1hot dog hot dog 
4 (fries12 fries 
5 2sandwich sandwich 
6 . . 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
   rename(food = fast_food) %>%
   mutate(
     clean = str_remove_all(
       food, "[:digit:]")
    ) %>%
   mutate(
     clean = str_remove_all(
       clean, "\\(")
    ) %>%
*  mutate(
*    clean = str_replace(
*      clean, "\\.", "Pizza")
*    )
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 2
 food clean 
 <chr> <chr> 
1 hamburgers hamburgers
2 (hamburgers9 hamburgers
3 1hot dog hot dog 
4 (fries12 fries 
5 2sandwich sandwich 
6 . Pizza 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
   rename(food = fast_food) %>%
   mutate(
     clean = str_remove_all(
       food, "[:digit:]")
    ) %>%
   mutate(
     clean = str_remove_all(
       clean, "\\(")
    ) %>%
   mutate(
     clean = str_replace(
       clean, "\\.", "Pizza")
     ) %>%
*  mutate(
*    clean = str_to_sentence(clean)
*    )
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 2
 food clean 
 <chr> <chr> 
1 hamburgers Hamburgers
2 (hamburgers9 Hamburgers
3 1hot dog Hot dog 
4 (fries12 Fries 
5 2sandwich Sandwich 
6 . Pizza 
```
]

---
count: false

.panel1-my_food-auto[

```r
   tibble(fast_food) %>%
   rename(food = fast_food) %>%
   mutate(
     clean = str_remove_all(
       food, "[:digit:]")
    ) %>%
   mutate(
     clean = str_remove_all(
       clean, "\\(")
    ) %>%
   mutate(
     clean = str_replace(
       clean, "\\.", "Pizza")
     ) %>%
   mutate(
     clean = str_to_sentence(clean)
     ) %>%
* select(Food = clean)
```
]
 
.panel2-my_food-auto[

```
# A tibble: 6 × 1
 Food 
 <chr> 
1 Hamburgers
2 Hamburgers
3 Hot dog 
4 Fries 
5 Sandwich 
6 Pizza 
```
]

---

---
## Last valuable tip

Separate()

It is not a str function but it is mostly used to separate strings into columns or rows.
It comes from the `tidyr` package (loaded with tidyverse)

Example:

```
     CountryCurrency ValuePer_1USd      Opinion
1 Bolivar- Venezuela     33137.833          OMG
2  Dollar- Australia         1.478           Ok
3     Peso- Colombia      4534.000 J e s u s...
```
--

```r
us_currency %>% 
  separate(CountryCurrency, c("Currency", "Country"), sep = "-") 
```

```
  Currency    Country ValuePer_1USd      Opinion
1  Bolivar  Venezuela     33137.833          OMG
2   Dollar  Australia         1.478           Ok
3     Peso   Colombia      4534.000 J e s u s...
```
---
## Let's remove that annoying space
count: false

.panel1-space-auto[

```r
*us_currency
```
]
 
.panel2-space-auto[

```
     CountryCurrency ValuePer_1USd      Opinion
1 Bolivar- Venezuela     33137.833          OMG
2  Dollar- Australia         1.478           Ok
3     Peso- Colombia      4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[

```r
 us_currency %>%
* separate(CountryCurrency,
*          c("Currency", "Country"),
*          sep = "-"
* )
```
]
 
.panel2-space-auto[

```
  Currency    Country ValuePer_1USd      Opinion
1  Bolivar  Venezuela     33137.833          OMG
2   Dollar  Australia         1.478           Ok
3     Peso   Colombia      4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[

```r
 us_currency %>%
  separate(CountryCurrency,
           c("Currency", "Country"),
           sep = "-"
  ) %>%
* mutate(
*   Country = str_trim(
*     Country,
*     side = "left")
* )
```
]
 
.panel2-space-auto[

```
  Currency   Country ValuePer_1USd      Opinion
1  Bolivar Venezuela     33137.833          OMG
2   Dollar Australia         1.478           Ok
3     Peso  Colombia      4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[

```r
 us_currency %>%
  separate(CountryCurrency,
           c("Currency", "Country"),
           sep = "-"
  ) %>%
  mutate(
    Country = str_trim(
      Country,
      side = "left")
  ) %>%
* dplyr::filter(
*   Country == "Colombia"
* )
```
]
 
.panel2-space-auto[

```
  Currency  Country ValuePer_1USd      Opinion
1     Peso Colombia          4534 J e s u s...
```
]

---

---
## Let's go practice 😢

1. Download the .rmd file sent to your email 
2. install `learnr`
3. hit run

---
## Useful resources

https://stringr.tidyverse.org/articles/regular-expressions.html
https://stringr.tidyverse.org/
https://ladal.edu.au/regex.html
https://nuitrcs.github.io/r-tidyverse/html/stringr.html
https://regexone.com/lesson/excluding_characters?

---
## [\\T^h*e \\ $E^n#d \\]
.center[![:scale 100%](images/final.png)]