extract all matches of a pattern and concatenate output with mutate

Question

This came up as I answered another question here: How to extract two specific patterns before another specific pattern using R?

I would like to extract all matches to a pattern from a string vector and concatenate the output into a single char vector. In the example, i want to extract all words that preceed the word "apple", which can be done with the regex '\\b[[:alpha:]]+\\b(?=\\sapple)'

the df tibble has the imput data as "phrase", and the expected output as "output":

data

df<-tibble(phrase=c("I like apples. I love apple pies. An apple a day...", "I hate apples. Specially apple pies"),
                output=c('like; love; An', 'hate; Specially'))
df
# A tibble: 2 x 2
  phrase                                              output         
  <chr>                                               <chr>          
1 I like apples. I love apple pies. An apple a day... like; love; An 
2 I hate apples. Specially apple pies                 hate; Specially

I have done it succesfully with this mutate %>% unnest_wider %>% unite approach:

df3 %>% mutate(output=str_extract_all(phrase, '\\b[[:alpha:]]+\\b(?=\\sapple)'))%>%
        unnest_wider(col=output, names_sep = '_')%>%
        unite(starts_with('output_'), col='output', sep = '; ', na.rm = TRUE)

I think this may be too involved. Is there a more straightforward approach? I was thinking of something with tidyr::extract or tidyr::separate, but found no answer to this.

I also tried to use str_extract_all, and then pasted the output, but got a weirdly formatted string:

df %>% mutate(output = str_extract_all(phrase, '\\b[[:alpha:]]+\\b(?=\\sapple)') %>%
                      paste(., collapse = ';'))

# A tibble: 2 x 2
  phrase                                              output                                                   
  <chr>                                               <chr>                                                    
1 I like apples. I love apple pies. An apple a day... "c(\"like\", \"love\", \"An\");c(\"hate\", \"Specially\"…
2 I hate apples. Specially apple pies                 "c(\"like\", \"love\", \"An\");c(\"hate\", \"Specially\"…

How about using `map_chr(str_extract_all(phrase, '\\b[[:alpha:]]+\\b(?=\\sapple)'), paste, collapse = "; ")`? — Ritchie Sacramento, Sep 08 '21 at 00:22
It is just embarrasing to see how simple it actually was. Maybe my brain just melted from looking into tidyr too much, but it was all over my face, thanks. Please post it as an answer. — GuedesBF, Sep 08 '21 at 00:26
I actually tried to paste the output of `str_extract_all`, as in `str_extract_all(....) %>% paste(...)` but got weird non-formated text. `map_chr` does the trick. — GuedesBF, Sep 08 '21 at 00:29
Hi @ritchie-sacramento. That's a great answer. For example I only combine the distinct outputs. I got 2 `like` and 3 `love` after pattern matching and I only wanna combine `like; love` [distinct words only]. How can I do that? — Roy, Jan 18 '23 at 20:44
@Roy - Please post a new question and if I see it will be happy to try and help. — Ritchie Sacramento, Jan 18 '23 at 22:38

extract all matches of a pattern and concatenate output with mutate

0 Answers0

Linked