Tidytext - set expressions as a single token

Question

I am trying to separate my text data into tokens using the unnest_tokens function from the tidytext package. The thing is that some expressions appear multiple times and I would like to keep them a single token instead of multiple tokens.

Normal outcome:

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice
4  1   text
5  2      a
6  2 second
7  2   nice
8  2   text

What I would like (expression = "nice text"):

df <- data.frame(
  Id = c(1, 2),
  Text = c('A first nice text', 'A second nice text')
)

df %>% 
  unnest_tokens(word, text)

  Id   Word
1  1      a
2  1  first
3  1   nice text
4  2      a
5  2 second
6  2   nice text

Please consider accepting one of the answers if they solve your problem. — deschen, Dec 12 '21 at 12:27

Chris Ruehlemann · Answer 1 · 2021-12-04T15:10:29.793

Here's a concise solution based on negative lookahead (?!...), to disallow separate_rows to separate Text on whitespace \\s if there's nice to the left of \\s and text to its right (\\bare word boundary anchors, in case you have, say, "nice texts", which you do want to separate)

library(tidyr)
df %>%
  separate_rows(Text, sep = "(?!\\bnice\\b)\\s(?!\\btext\\b)")
# A tibble: 6 × 2
     Id Text     
  <dbl> <chr>    
1     1 A        
2     1 first    
3     1 nice text
4     2 A        
5     2 second   
6     2 nice text

A more advanced regex is with (*SKIP)(*F):

df %>%
  separate_rows(Text, sep = "(\\bnice text\\b)(*SKIP)(*F)|\\s")

For more info: How do (*SKIP) or (*F) work on regex?

That is waht I had in mind. A solution with regex taht excludes „nice text“ from the split. — deschen, Dec 05 '21 at 09:51

score 1 · Answer 2 · edited Dec 05 '21 at 09:31

A bit verbose, and there might be an option to exclude certain phrases in the unnest_tokens, but it does the trick:

library(tidyverse)
library(tidytext)
df <- data.frame(Id = c(1, 2),,
                 Text = c('A first nice text', 'A second nice text')) %>%
  unnest_tokens('Word', Text)

df %>%
  group_by(Id) %>%
  summarize(Word = paste(if_else(lag(Word) == 'nice' & Word == 'text', 'nice text', Word))) %>%
  mutate(temp_id = row_number()) %>%
  filter(temp_id != temp_id[Word == 'nice text'] - 1) %>%
  ungroup() %>%
  select(-temp_id)

which gives:

# A tibble: 6 x 2
     Id Word     
  <dbl> <chr>    
1     1 a        
2     1 first    
3     1 nice text
4     2 a        
5     2 second   
6     2 nice text

Tidytext - set expressions as a single token

2 Answers2