class: center, middle, inverse, title-slide .title[ # Stringr and Regular Expressions ] .subtitle[ ## PHC 6701 - Summer 2022 ] .author[ ### Catalina Cañizares ] .date[ ### Jul 14, 2022 ] --- # Overview ## We will cover the following: + String manipulation. + `str`_functions. + Regular Expressions. --- # Prepare the following packages for this lecture: ```r install.packages("learnr") library(learnr) library(palmerpenguins) library(tidyverse) ``` .pull-left[ .center[![:scale 40%](images/tidyverse.png)] ] .pull-right[ .center[![:scale 40%](images/palmer_penguins.png)] ] .center[![:scale 40%](images/learnr.png)] --- # What are strings ? Words or characters. .center[![:scale 50%](images/stringr_sticker.png)] ```r string <- c( "It is just english", "inside a vector") string ``` ``` [1] "It is just english" "inside a vector" ``` --- # Why should you care? Manipulating strings usually is part of the Wrangling process .center[![:scale 80%](images/process.png)] --- # `stringr` Package? + `stringr` version 1.4.0 has a total of 53 functions. -- + 41 of them start with `str_` ```r stringr::str ``` .center[![:scale 50%](images/str_functions.png)] --- # We are not covering all... .center[![:scale 80%](images/learning_curve.png)] -- <mark>But we will cover the most often used functions</mark> --- # The functions .center[![:scale 50%](images/str_notes.png)] ```r str_legth() str_c() str_sub() str_split() str_subset() str_extract() str_match() str_replace() str_replace_all() str_to_lower() str_to_upper() str_to_title() ``` --- # Examples -- <mark>**str_length()**</mark>. Counts the number of "code points", in a string ```r names <- c("Gabriel", "Anny", "Catalina") ``` ```r str_length(names) ``` ``` [1] 7 4 8 ``` -- <mark>**str_c()**</mark>. Joins two or more vectors into a single character vector, ```r sep_sentence <- c("Hi everyone", "I want this to be", "only one sentence", "HELP!") sep_sentence ``` ``` [1] "Hi everyone" "I want this to be" "only one sentence" [4] "HELP!" ``` ```r one_sentence <- str_c(sep_sentence, collapse = ", ") one_sentence ``` ``` [1] "Hi everyone, I want this to be, only one sentence, HELP!" ``` --- # Examples -- <mark>**str_sub**</mark>. Extract substrings from a character vector, from = start, to = end ```r sense <- ("This makes no sense") str_sub(sense, 1,10) ``` ``` [1] "This makes" ``` ```r str_sub(sense, 15) ``` ``` [1] "sense" ``` -- <mark>**str_subset**()</mark> Keep strings matching a pattern, ```r more_sense <- c("This makes no sense", "although", "I am", "understanding") str_subset(more_sense, "u") ``` ``` [1] "although" "understanding" ``` ```r str_subset(more_sense, "c") ``` ``` character(0) ``` --- # Examples -- <mark>**str_split()**</mark>. Split up a string into pieces ```r surnames <- c("Ferdus Tarana", "Rodriguez Christofer", "Salehe Said") surnames ``` ``` [1] "Ferdus Tarana" "Rodriguez Christofer" "Salehe Said" ``` ```r str_split(surnames, pattern = "," , simplify = TRUE) ``` ``` [,1] [1,] "Ferdus Tarana" [2,] "Rodriguez Christofer" [3,] "Salehe Said" ``` -- <mark>**str_extract**()</mark> Extract matching patterns from a string ```r project2 <- c("12L", "52M", "1B", "nonsense2", "12", "more 2 think") str_extract(project2, "\\d") ``` ``` [1] "1" "5" "1" "2" "1" "2" ``` ```r str_extract(project2, "\\d+") ``` ``` [1] "12" "52" "1" "2" "12" "2" ``` --- # Examples -- <mark>str_replace()</mark> Replace matched patterns in a string ```r typos <- c("kansas", "kansas", "kansas", "kansas", "k", "hansas", "k") str_replace(typos, pattern = "kansas", replacement = "k") ``` ``` [1] "k" "k" "k" "k" "k" "hansas" "k" ``` --- # Examples -- <mark>str_to_lower()</mark> ```r capital <- c("MY OBJECT", "ONLY HAS", "CAPITAL LETTERS", "UNTIL", "...") lower_case <- str_to_lower(capital) lower_case ``` ``` [1] "my object" "only has" "capital letters" "until" [5] "..." ``` -- <mark>str_to_upper()</mark> ```r str_to_upper(lower_case) ``` ``` [1] "MY OBJECT" "ONLY HAS" "CAPITAL LETTERS" "UNTIL" [5] "..." ``` -- <mark>str_to_title()</mark> ```r str_to_title(capital) ``` ``` [1] "My Object" "Only Has" "Capital Letters" "Until" [5] "..." ``` --- class: segue Regular Expressions --- # Rexexps .center[![:scale 100%](images/cat.png)] --- # For real...Regexp Are a concise and flexible tool for describing patterns in strings. Examples: ```r fast_food <- c("hamburgers", "(hamburgers9", "1hot dog", "(fries12", "2sandwich", ".") ``` -- ```r str_extract(fast_food, "h") ``` ``` [1] "h" "h" "h" NA "h" NA ``` -- ```r str_extract(fast_food, "^h") ``` ``` [1] "h" NA NA NA NA NA ``` -- ```r str_extract(fast_food, "h$") ``` ``` [1] NA NA NA NA "h" NA ``` -- ```r str_extract(fast_food, "\\d+") ``` ``` [1] NA "9" "1" "12" "2" NA ``` --- # Types of Regular Expressions 😄😿 😠 There are **three** basic types of regular expressions: -- 1. Regular expressions that stand for individual symbols and determine frequencies -- 2. Regular expressions that stand for classes of symbols -- 3. Regular expressions that stand for structural properties .center[![:scale 60%](images/cat2.png)] --- # Escaping... 🚀 ```r fast_food ``` ``` [1] "hamburgers" "(hamburgers9" "1hot dog" "(fries12" "2sandwich" [6] "." ``` The last character in our string is `"."` How do you match a literal `"."` or `"("`? -- ```r str_extract(fast_food, ".") ``` ``` [1] "h" "(" "1" "(" "2" "." ``` --- # Escaping... 🚀 -- ```r str_extract(fast_food, "\\.") ``` ``` [1] NA NA NA NA NA "." ``` -- ```r str_extract(fast_food, "\\(") ``` ``` [1] NA "(" NA "(" NA NA ``` -- ```r str_extract(fast_food, c("\\.|\\(" )) ``` ``` [1] NA "(" NA "(" NA "." ``` -- You need to use an “escape” to tell the regular expression you want to match it exactly not use its special behavior. To create the regular expression `"\."` you need the string `"\\."` --- ## Regular expressions that stand for individual symbols and determine frequencies. | RegEx Symbol/Sequence | Explanation | | --------------------- | ----------------------------------------------------- | | \* | The preceding item will be matched zero or more times | | + | The preceding item will be matched one or more times | |\. |The Wild card, takes the place of anything --- ## Regular expressions that stand for individual symbols and determine frequencies. <mark>Examples</mark> -- ```r str_extract(fast_food, "ham.+") ``` ``` [1] "hamburgers" "hamburgers9" NA NA NA [6] NA ``` -- ```r str_extract(fast_food, ".+") ``` ``` [1] "hamburgers" "(hamburgers9" "1hot dog" "(fries12" "2sandwich" [6] "." ``` -- ```r str_replace(fast_food, "\\(", "") ``` ``` [1] "hamburgers" "hamburgers9" "1hot dog" "fries12" "2sandwich" [6] "." ``` --- ## Regular expressions that stand for classes of symbols. | RegEx Symbol/Sequence | Explanation | | --------------------- | ----------------------------------------------------------------- | | \[ab\] | lower case a and b | | \[a-z\] | all lower case characters from a to z | | \[AB\] | upper case a and b | | \[A-Z\] | all upper case characters from A to Z | | \[12\] | digits 1 and 2 | | \[0-9\] | digits: 0 1 2 3 4 5 6 7 8 9 | | \[:digit:\] | digits: 0 1 2 3 4 5 6 7 8 9 | | \[:lower:\] | lower case characters: a–z | | \[:upper:\] | upper case characters: A–Z | --- ## Regular expressions that stand for classes of symbols. <mark>Examples</mark> ```r str_extract(fast_food, "[ab].+") ``` ``` [1] "amburgers" "amburgers9" NA NA "andwich" [6] NA ``` -- ```r str_extract(fast_food, "[1-2].+") ``` ``` [1] NA NA "1hot dog" "12" "2sandwich" NA ``` -- ```r str_extract(fast_food, "[:digit:]") ``` ``` [1] NA "9" "1" "1" "2" NA ``` --- ## Regular expressions that stand for structural properties. | RegEx Symbol/Sequence | Explanation | | --------------------- | -------------------------------------- | | \\\\w | Word characters: \[\[:alnum:\]\_\] | | \\\\W | No word characters: \[^\[:alnum:\]\_\] | | \\\\s | Space characters: \[\[:blank:\]\] | | \\\\S | No space characters: \[^\[:blank:\]\] | | \\\\d | Digits: \[\[:digit:\]\] | | \\\\D | No digits: \[^\[:digit:\]\] | | ^ | Beginning of a string | | $ | End of a string | --- ## Regular expressions that stand for structural properties. <mark>Examples</mark> ```r str_extract(fast_food, "\\d") ``` ``` [1] NA "9" "1" "1" "2" NA ``` -- ```r str_extract(fast_food, "\\W") ``` ``` [1] NA "(" " " "(" NA "." ``` -- ```r str_extract(fast_food, "g$") ``` ``` [1] NA NA "g" NA NA NA ``` --- ## Finally, let's clean that horrible fast_food string count: false .panel1-my_food-auto[ ```r * tibble(fast_food) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 1 fast_food <chr> 1 hamburgers 2 (hamburgers9 3 1hot dog 4 (fries12 5 2sandwich 6 . ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% * rename(food = fast_food) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 1 food <chr> 1 hamburgers 2 (hamburgers9 3 1hot dog 4 (fries12 5 2sandwich 6 . ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% rename(food = fast_food) %>% * mutate( * clean = str_remove_all( * food, "[:digit:]") * ) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 2 food clean <chr> <chr> 1 hamburgers hamburgers 2 (hamburgers9 (hamburgers 3 1hot dog hot dog 4 (fries12 (fries 5 2sandwich sandwich 6 . . ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% rename(food = fast_food) %>% mutate( clean = str_remove_all( food, "[:digit:]") ) %>% * mutate( * clean = str_remove_all( * clean, "\\(") * ) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 2 food clean <chr> <chr> 1 hamburgers hamburgers 2 (hamburgers9 hamburgers 3 1hot dog hot dog 4 (fries12 fries 5 2sandwich sandwich 6 . . ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% rename(food = fast_food) %>% mutate( clean = str_remove_all( food, "[:digit:]") ) %>% mutate( clean = str_remove_all( clean, "\\(") ) %>% * mutate( * clean = str_replace( * clean, "\\.", "Pizza") * ) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 2 food clean <chr> <chr> 1 hamburgers hamburgers 2 (hamburgers9 hamburgers 3 1hot dog hot dog 4 (fries12 fries 5 2sandwich sandwich 6 . Pizza ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% rename(food = fast_food) %>% mutate( clean = str_remove_all( food, "[:digit:]") ) %>% mutate( clean = str_remove_all( clean, "\\(") ) %>% mutate( clean = str_replace( clean, "\\.", "Pizza") ) %>% * mutate( * clean = str_to_sentence(clean) * ) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 2 food clean <chr> <chr> 1 hamburgers Hamburgers 2 (hamburgers9 Hamburgers 3 1hot dog Hot dog 4 (fries12 Fries 5 2sandwich Sandwich 6 . Pizza ``` ] --- count: false .panel1-my_food-auto[ ```r tibble(fast_food) %>% rename(food = fast_food) %>% mutate( clean = str_remove_all( food, "[:digit:]") ) %>% mutate( clean = str_remove_all( clean, "\\(") ) %>% mutate( clean = str_replace( clean, "\\.", "Pizza") ) %>% mutate( clean = str_to_sentence(clean) ) %>% * select(Food = clean) ``` ] .panel2-my_food-auto[ ``` # A tibble: 6 × 1 Food <chr> 1 Hamburgers 2 Hamburgers 3 Hot dog 4 Fries 5 Sandwich 6 Pizza ``` ] <style> .panel1-my_food-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-my_food-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-my_food-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- --- ## Last valuable tip Separate() It is not a str function but it is mostly used to separate strings into columns or rows. It comes from the `tidyr` package (loaded with tidyverse) Example: ``` CountryCurrency ValuePer_1USd Opinion 1 Bolivar- Venezuela 33137.833 OMG 2 Dollar- Australia 1.478 Ok 3 Peso- Colombia 4534.000 J e s u s... ``` -- ```r us_currency %>% separate(CountryCurrency, c("Currency", "Country"), sep = "-") ``` ``` Currency Country ValuePer_1USd Opinion 1 Bolivar Venezuela 33137.833 OMG 2 Dollar Australia 1.478 Ok 3 Peso Colombia 4534.000 J e s u s... ``` --- ## Let's remove that annoying space count: false .panel1-space-auto[ ```r *us_currency ``` ] .panel2-space-auto[ ``` CountryCurrency ValuePer_1USd Opinion 1 Bolivar- Venezuela 33137.833 OMG 2 Dollar- Australia 1.478 Ok 3 Peso- Colombia 4534.000 J e s u s... ``` ] --- count: false .panel1-space-auto[ ```r us_currency %>% * separate(CountryCurrency, * c("Currency", "Country"), * sep = "-" * ) ``` ] .panel2-space-auto[ ``` Currency Country ValuePer_1USd Opinion 1 Bolivar Venezuela 33137.833 OMG 2 Dollar Australia 1.478 Ok 3 Peso Colombia 4534.000 J e s u s... ``` ] --- count: false .panel1-space-auto[ ```r us_currency %>% separate(CountryCurrency, c("Currency", "Country"), sep = "-" ) %>% * mutate( * Country = str_trim( * Country, * side = "left") * ) ``` ] .panel2-space-auto[ ``` Currency Country ValuePer_1USd Opinion 1 Bolivar Venezuela 33137.833 OMG 2 Dollar Australia 1.478 Ok 3 Peso Colombia 4534.000 J e s u s... ``` ] --- count: false .panel1-space-auto[ ```r us_currency %>% separate(CountryCurrency, c("Currency", "Country"), sep = "-" ) %>% mutate( Country = str_trim( Country, side = "left") ) %>% * dplyr::filter( * Country == "Colombia" * ) ``` ] .panel2-space-auto[ ``` Currency Country ValuePer_1USd Opinion 1 Peso Colombia 4534 J e s u s... ``` ] <style> .panel1-space-auto { color: black; width: 38.6060606060606%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel2-space-auto { color: black; width: 59.3939393939394%; hight: 32%; float: left; padding-left: 1%; font-size: 80% } .panel3-space-auto { color: black; width: NA%; hight: 33%; float: left; padding-left: 1%; font-size: 80% } </style> --- --- ## Let's go practice 😢 1. Download the .rmd file sent to your email 2. install `learnr` 3. hit run --- ## Useful resources https://stringr.tidyverse.org/articles/regular-expressions.html https://stringr.tidyverse.org/ https://ladal.edu.au/regex.html https://nuitrcs.github.io/r-tidyverse/html/stringr.html https://regexone.com/lesson/excluding_characters? --- ## [\\T^h*e \\ $E^n#d \\] .center[![:scale 100%](images/final.png)]