Manipulating strings usually is part of the Wrangling process .center[] --- # `stringr` Package? + `stringr` version 1.4.0 has a total of 53 functions. -- + 41 of them start with `str_` ```r stringr::str ``` .center[] --- # We are not covering all... .center[] -- <mark>But we will cover the most often used functions</mark> --- # The functions .center[] ```r str_legth() str_c() str_sub() str_split() str_subset() str_extract() str_match() str_replace() str_replace_all() str_to_lower() str_to_upper() str_to_title() ``` --- # Examples -- <mark>**str_length()**</mark>. Counts the number of "code points", in a string ```r names <- c("Gabriel", "Anny", "Catalina") ``` ```r str_length(names) ``` ``` [1] 7 4 8 ``` -- <mark>**str_c()**</mark>. Joins two or more vectors into a single character vector, ```r sep_sentence <- c("Hi everyone", "I want this to be", "only one sentence", "HELP!") sep_sentence ``` ``` [1] "Hi everyone" "I want this to be" "only one sentence" [4] "HELP!" ``` ```r one_sentence <- str_c(sep_sentence, collapse = ", ") one_sentence ``` ``` [1] "Hi everyone, I want this to be, only one sentence, HELP!" ``` --- # Examples -- <mark>**str_sub**</mark>. Extract substrings from a character vector, from = start, to = end ```r sense <- ("This makes no sense") str_sub(sense, 1,10) ``` ``` [1] "This makes" ``` ```r str_sub(sense, 15) ``` ``` [1] "sense" ``` -- <mark>**str_subset**()</mark> Keep strings matching a pattern, ```r more_sense <- c("This makes no sense", "although", "I am", "understanding") str_subset(more_sense, "u") ``` ``` [1] "although" "understanding" ``` ```r str_subset(more_sense, "c") ``` ``` character(0) ``` --- # Examples -- <mark>**str_split()**</mark>. Split up a string into pieces ```r surnames <- c("Ferdus Tarana", "Rodriguez Christofer", "Salehe Said") surnames ``` ``` [1] "Ferdus Tarana" "Rodriguez Christofer" "Salehe Said" ``` ```r str_split(surnames, pattern = "," , simplify = TRUE) ``` ``` [,1] [1,] "Ferdus Tarana" [2,] "Rodriguez Christofer" [3,] "Salehe Said" ``` -- <mark>**str_extract**()</mark> Extract matching patterns from a string ```r project2 <- c("12L", "52M", "1B", "nonsense2", "12", "more 2 think") str_extract(project2, "\\d") ``` ``` [1] "1" "5" "1" "2" "1" "2" ``` ```r str_extract(project2, "\\d+") ``` ``` [1] "12" "52" "1" "2" "12" "2" ``` --- # Examples -- <mark>str_replace()</mark> Replace matched patterns in a string ```r typos <- c("kansas", "kansas", "kansas", "kansas", "k", "hansas", "k") str_replace(typos, pattern = "kansas", replacement = "k") ``` ``` [1] "k" "k" "k" "k" "k" "hansas" "k" ``` --- # Examples -- <mark>str_to_lower()</mark> ```r capital <- c("MY OBJECT", "ONLY HAS", "CAPITAL LETTERS", "UNTIL", "...") lower_case <- str_to_lower(capital) lower_case ``` ``` [1] "my object" "only has" "capital letters" "until" [5] "..." ``` -- <mark>str_to_upper()</mark> ```r str_to_upper(lower_case) ``` ``` [1] "MY OBJECT" "ONLY HAS" "CAPITAL LETTERS" "UNTIL" [5] "..." ``` -- <mark>str_to_title()</mark> ```r str_to_title(capital) ``` ``` [1] "My Object" "Only Has" "Capital Letters" "Until" [5] "..." ``` --- class: segue Regular Expressions --- # Rexexps .center[] --- # For real...Regexp Are a concise and flexible tool for describing patterns in strings. Examples: ```r fast_food <- c("hamburgers", "(hamburgers9", "1hot dog", "(fries12", "2sandwich", ".") ``` -- ```r str_extract(fast_food, "h") ``` ``` [1] "h" "h" "h" NA "h" NA ``` -- ```r str_extract(fast_food, "^h") ``` ``` [1] "h" NA NA NA NA NA ``` -- ```r str_extract(fast_food, "h$") ``` ``` [1] NA NA NA NA "h" NA ``` -- ```r str_extract(fast_food, "\\d+") ``` ``` [1] NA "9" "1" "12" "2" NA ``` --- # Types of Regular Expressions 😄😿 😠 There are **three** basic types of regular expressions: -- 1. Regular expressions that stand for individual symbols and determine frequencies -- 2. Regular expressions that stand for classes of symbols -- 3. Regular expressions that stand for structural properties .center[] --- # Escaping... 🚀 ```r fast_food ``` ``` [1] "hamburgers" "(hamburgers9" "1hot dog" "(fries12" "2sandwich" [6] "." ``` The last character in our string is `"."` How do you match a literal `"."` or `"("`? -- ```r str_extract(fast_food, ".") ``` ``` [1] "h" "(" "1" "(" "2" "." ``` --- # Escaping... 🚀 -- ```r str_extract(fast_food, "\\.") ``` ``` [1] NA NA NA NA NA "." ``` -- ```r str_extract(fast_food, "\\(") ``` ``` [1] NA "(" NA "(" NA NA ``` -- ```r str_extract(fast_food, c("\\.|\\(" )) ``` ``` [1] NA "(" NA "(" NA "." ``` -- You need to use an “escape” to tell the regular expression you want to match it exactly not use its special behavior. ## Regular expressions that stand for individual symbols and determine frequencies.

| RegEx Symbol/Sequence | Explanation |
| --------------------- | ----------------------------------------------------- |
| \* | The preceding item will be matched zero or more times |
| + | The preceding item will be matched one or more times |
|\. |The Wild card, takes the place of anything

---

## Regular expressions that stand for individual symbols and determine frequencies.

<mark>Examples</mark>

--

```r
str_extract(fast_food, "ham.+")
```

```
[1] "hamburgers"  "hamburgers9" NA            NA            NA           
[6] NA           
```

--

```r
str_extract(fast_food, ".+")
```

```
[1] "hamburgers"   "(hamburgers9" "1hot dog"     "(fries12"     "2sandwich"   
[6] "."           
```

--

```r
str_replace(fast_food, "\\(", "")
```

```
[1] "hamburgers"  "hamburgers9" "1hot dog"    "fries12"     "2sandwich"  
[6] "."          
```

---

## Regular expressions that stand for classes of symbols.

| RegEx Symbol/Sequence | Explanation                                                       |
| --------------------- | ----------------------------------------------------------------- |
| \[ab\]                | lower case a and b                                                |
| \[a-z\]               | all lower case characters from a to z                             |
| \[AB\]                | upper case a and b                                                |
| \[A-Z\]               | all upper case characters from A to Z                             |
| \[12\]                | digits 1 and 2                                                    |
| \[0-9\]               | digits: 0 1 2 3 4 5 6 7 8 9                                       |
| \[:digit:\]           | digits: 0 1 2 3 4 5 6 7 8 9                                       |
| \[:lower:\]           | lower case characters: a–z                                        |
| \[:upper:\]           | upper case characters: A–Z                                        |

---

## Regular expressions that stand for classes of symbols.

<mark>Examples</mark>

```r
str_extract(fast_food, "[ab].+")
```

```
[1] "amburgers"  "amburgers9" NA           NA           "andwich"   
[6] NA          
```

--

```r
str_extract(fast_food, "[1-2].+")
```

```
[1] NA           NA           "1hot dog"   "12"         "2sandwich" 
[6] NA          
```

--

```r
str_extract(fast_food, "[:digit:]")
```

```
[1] NA  "9" "1" "1" "2" NA 
```

---

## Regular expressions that stand for structural properties.

| RegEx Symbol/Sequence | Explanation                            |
| --------------------- | -------------------------------------- |
| \\\\w                 | Word characters: \[\[:alnum:\]\_\]     |
| \\\\W                 | No word characters: \[^\[:alnum:\]\_\] |
| \\\\s                 | Space characters: \[\[:blank:\]\]      |
| \\\\S                 | No space characters: \[^\[:blank:\]\]  |
| \\\\d                 | Digits: \[\[:digit:\]\]                |
| \\\\D                 | No digits: \[^\[:digit:\]\]            |
| ^                     | Beginning of a string                  |
| $                     | End of a string                        |

---

## Regular expressions that stand for structural properties.

<mark>Examples</mark>

```r
str_extract(fast_food, "\\d")
```

```
[1] NA  "9" "1" "1" "2" NA 
```

--

```r
str_extract(fast_food, "\\W")
```

```
[1] NA  "(" " " "(" NA  "."
```

--

```r
str_extract(fast_food, "g$")
```

```
[1] NA  NA  "g" NA  NA  NA 
```

---

## Finally, let's clean that horrible fast_food string

count: false

.panel1-my_food-auto[
```r
* tibble(fast_food)
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 1
  fast_food    
  <chr>        
1 hamburgers   
2 (hamburgers9 
3 1hot dog     
4 (fries12     
5 2sandwich    
6 .            
```
]

---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
* rename(food = fast_food)
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 1
  food         
  <chr>        
1 hamburgers   
2 (hamburgers9 
3 1hot dog     
4 (fries12     
5 2sandwich    
6 .            
```
]

---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
  rename(food = fast_food) %>%
* mutate(
*   clean = str_remove_all(
*     food, "[:digit:]")
* )
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 2
  food         clean      
  <chr>        <chr>      
1 hamburgers   hamburgers 
2 (hamburgers9 (hamburgers
3 1hot dog     hot dog    
4 (fries12     (fries     
5 2sandwich    sandwich   
6 .            .          
```
]

---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
  rename(food = fast_food) %>%
  mutate(
    clean = str_remove_all(
      food, "[:digit:]")
  ) %>%
* mutate(
*   clean = str_remove_all(
*     clean, "\\(")
* )
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 2
  food         clean     
  <chr>        <chr>     
1 hamburgers   hamburgers
2 (hamburgers9 hamburgers
3 1hot dog     hot dog   
4 (fries12     fries     
5 2sandwich    sandwich  
6 .            .         
```
]

---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
  rename(food = fast_food) %>%
  mutate(
    clean = str_remove_all(
      food, "[:digit:]")
  ) %>%
  mutate(
    clean = str_remove_all(
      clean, "\\(")
  ) %>%
* mutate(
*   clean = str_replace(
*     clean, "\\.", "Pizza")
* )
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 2
  food         clean     
  <chr>        <chr>     
1 hamburgers   hamburgers
2 (hamburgers9 hamburgers
3 1hot dog     hot dog   
4 (fries12     fries     
5 2sandwich    sandwich  
6 .            Pizza
```
] ---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
  rename(food = fast_food) %>%
  mutate(
    clean = str_remove_all(
      food, "[:digit:]")
  ) %>%
  mutate(
    clean = str_remove_all(
      clean, "\\(")
  ) %>%
  mutate(
    clean = str_replace(
      clean, "\\.", "Pizza")
  ) %>%
* mutate(
*   clean = str_to_sentence(clean)
* )
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 2
  food         clean     
  <chr>        <chr>     
1 hamburgers   Hamburgers
2 (hamburgers9 Hamburgers
3 1hot dog     Hot dog   
4 (fries12     Fries     
5 2sandwich    Sandwich  
6 .            Pizza
```
] ---
count: false

.panel1-my_food-auto[
```r
tibble(fast_food) %>%
  rename(food = fast_food) %>%
  mutate(
    clean = str_remove_all(
      food, "[:digit:]")
  ) %>%
  mutate(
    clean = str_remove_all(
      clean, "\\(")
  ) %>%
  mutate(
    clean = str_replace(
      clean, "\\.", "Pizza")
  ) %>%
  mutate(
    clean = str_to_sentence(clean)
  ) %>%
* select(Food = clean)
```
]

.panel2-my_food-auto[
```
# A tibble: 6 × 1
  Food      
  <chr>     
1 Hamburgers
2 Hamburgers
3 Hot dog   
4 Fries     
5 Sandwich  
6 Pizza
```
]

---

---

## Last valuable tip

Separate()

It is not a str function but it is mostly used to separate strings into columns or rows. It comes from the `tidyr` package (loaded with tidyverse)

Example:

```
  CountryCurrency ValuePer_1USd      Opinion
1 Bolivar- Venezuela      33137.833          OMG
2  Dollar- Australia          1.478           Ok
3    Peso- Colombia       4534.000 J e s u s...
```

--

```r
us_currency %>%
  separate(CountryCurrency, c("Currency", "Country"), sep = "-")
```

```
  Currency   Country ValuePer_1USd      Opinion
1  Bolivar Venezuela      33137.833          OMG
2   Dollar Australia          1.478           Ok
3     Peso  Colombia       4534.000 J e s u s...
```

---

## Let's remove that annoying space

count: false

.panel1-space-auto[
```r
*us_currency
```
]

.panel2-space-auto[
```
  CountryCurrency ValuePer_1USd      Opinion
1 Bolivar- Venezuela      33137.833          OMG
2  Dollar- Australia          1.478           Ok
3    Peso- Colombia       4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[
```r
us_currency %>%
* separate(CountryCurrency,
*          c("Currency", "Country"),
*          sep = "-"
*          )
```
]

.panel2-space-auto[
```
  Currency   Country ValuePer_1USd      Opinion
1  Bolivar Venezuela      33137.833          OMG
2   Dollar Australia          1.478           Ok
3     Peso  Colombia       4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[
```r
us_currency %>%
  separate(CountryCurrency,
           c("Currency", "Country"),
           sep = "-"
           ) %>%
* mutate(
*   Country = str_trim(
*     Country,
*     side = "left")
* )
```
]

.panel2-space-auto[
```
  Currency   Country ValuePer_1USd      Opinion
1  Bolivar Venezuela      33137.833          OMG
2   Dollar Australia          1.478           Ok
3     Peso  Colombia       4534.000 J e s u s...
```
]

---
count: false

.panel1-space-auto[
```r
us_currency %>%
  separate(CountryCurrency,
           c("Currency", "Country"),
           sep = "-"
           ) %>%
  mutate(
    Country = str_trim(
      Country,
      side = "left")
  ) %>%
* dplyr::filter(
*   Country == "Colombia"
* )
```
]

.panel2-space-auto[
```
  Currency  Country ValuePer_1USd      Opinion
1     Peso Colombia          4534 J e s u s...
```
]

---

---

## Let's go practice 😢

1. Download the .rmd file sent to your email
2. install `learnr`
3. hit run

---

## Useful resources ---

## [\\T^h*e \\ $E^n#d \\]

.center[]