This came up as I answered another question here: How to extract two specific patterns before another specific pattern using R?
I would like to extract all matches to a pattern from a string vector and concatenate the output into a single char vector.
In the example, i want to extract all words that preceed the word "apple", which can be done with the regex '\\b[[:alpha:]]+\\b(?=\\sapple)'
the df
tibble has the imput data as "phrase", and the expected output as "output":
data
df<-tibble(phrase=c("I like apples. I love apple pies. An apple a day...", "I hate apples. Specially apple pies"),
output=c('like; love; An', 'hate; Specially'))
df
# A tibble: 2 x 2
phrase output
<chr> <chr>
1 I like apples. I love apple pies. An apple a day... like; love; An
2 I hate apples. Specially apple pies hate; Specially
I have done it succesfully with this mutate %>% unnest_wider %>% unite
approach:
df3 %>% mutate(output=str_extract_all(phrase, '\\b[[:alpha:]]+\\b(?=\\sapple)'))%>%
unnest_wider(col=output, names_sep = '_')%>%
unite(starts_with('output_'), col='output', sep = '; ', na.rm = TRUE)
I think this may be too involved. Is there a more straightforward approach? I was thinking of something with tidyr::extract
or tidyr::separate
, but found no answer to this.
I also tried to use str_extract_all
, and then paste
d the output, but got a weirdly formatted string:
df %>% mutate(output = str_extract_all(phrase, '\\b[[:alpha:]]+\\b(?=\\sapple)') %>%
paste(., collapse = ';'))
# A tibble: 2 x 2
phrase output
<chr> <chr>
1 I like apples. I love apple pies. An apple a day... "c(\"like\", \"love\", \"An\");c(\"hate\", \"Specially\"…
2 I hate apples. Specially apple pies "c(\"like\", \"love\", \"An\");c(\"hate\", \"Specially\"…