I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.
Normal outcome:
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice
4 1 text
5 2 a
6 2 second
7 2 nice
8 2 text
What I would like (expression = "nice text"):
df <- data.frame(
Id = c(1, 2),
Text = c('A first nice text', 'A second nice text')
)
df %>%
unnest_tokens(word, text)
Id Word
1 1 a
2 1 first
3 1 nice text
4 2 a
5 2 second
6 2 nice text