Removing ngrams containing stopwords using tidytext

Question

UPDATE: Thanks for the input so far. I rewritten the question and added a better example to highlight the implicit requirements that were not covered in my first example.

Question I am looking for a general tidy solution to removing ngrams containing stopwords. In short, ngrams are strings of words separated by a space. A unigram contains 1 word, a bigram 2 words, and so on. My goal would be to apply this on a data frame after using unnest_tokens(). The solution should work with a data frame containing a mix of ngrams of any length (uni, bi, tri..), or at least bi & tri and above.

For more information on ngrams, see wiki: https://en.wikipedia.org/wiki/N-gram
I am aware of this question: Remove ngrams with leading and trailing stopwords. However, I am looking for a general solution, that does not require the stopword to be leading or trailing and that would also scale somewhat nicely.
as pointed out in the comments, there is a solution for bigrams documented here: https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams

New example data

ngram_df <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                    "the",
          1,              "the basis",
          1,                  "basis",
          1,       "basis of culture",
          1,                "culture",
          1,        "is ground water",
          1,           "ground water",
          1, "ground water treatment"
  )
stopword_df <- tibble::tribble(
  ~word, ~lexicon,
  "the", "custom",
   "of", "custom",
   "is", "custom"
  )
desired_output <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                  "basis",
          1,                "culture",
          1,           "ground water",
          1, "ground water treatment"
  )

^{Created on 2019-03-21 by the reprex package (v0.2.1)}

Desired behaviour

the ngram_df should be transformed into the desired_output, using the stopwords from the word column in the stopword_df.
every row containing a stopword should be removed
word boundaries should be respected (i.e. looking for is should not remove basis)

^{my first attempt for a reprex below:

example data
library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>%
enframe() %>%
unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here

df
#> # A tibble: 21 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 remediation is
#> 3 1 is the
#> 4 1 the process
#> 5 1 process that
#> 6 1 that is
#> 7 1 is used
#> 8 1 used to
#> 9 1 to treat
#> 10 1 treat polluted
#> # ... with 11 more rows

example list of stopwords
stopwords <- c("is", "the", "that", "to")

desired output
#> Source: local data frame [9 x 2]
#> Groups: <by row>
#>
#> # A tibble: 9 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 treat polluted
#> 3 1 polluted groundwater
#> 4 1 groundwater by
#> 5 1 by removing
#> 6 1 pollutants or
#> 7 1 or converting
#> 8 1 them into
#> 9 1 harmless products

^{Created on 2019-03-20 by the reprex package (v0.2.1)}
(example sentence from: https://en.wikipedia.org/wiki/Groundwater_remediation)}

I guess the reason is to avoid getting "false" ngrams. For example if you have the sentence "Sky is blue" and you remove is before finding bigrams, you would end up with finding sky blue, which wouldn't be a true bigram if stopwords were considered. Maybe a workaround could be to replace all stopwords by one and the same unique placeholder-string before finding ngrams and afterwards deleating all ngrams containing the placeholder-string? — TinglTanglBob, Mar 20 '19 at 15:29
Check out "Text Mining with R" by Silge and Robinson, specifically this section here: https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams — Marian Minar, Mar 20 '19 at 16:24
thanks, @MarianMinar. That's a good start. I'd like to see a version though that could handle words and bigrams & trigrams in one go. I realize that this is not obvious from my example above. I will try to update the question today. — Benjamin Schwetz, Mar 21 '19 at 08:18

César Arquero Cabral · Answer 1 · 2019-03-20T16:35:59.280

Here you have another way using the "stopwords_collapsed" from the previous answer:

swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

df
# A tibble: 8 x 2
   name ngrams                 
  <int> <chr>                  
1     1 groundwater remediation
2     1 treat polluted         
3     1 polluted groundwater   
4     1 groundwater by         
5     1 by removing            
6     1 pollutants or          
7     1 or converting          
8     1 harmless products

Here you have a simple benchmark comparing both systems:

#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>% 
    enframe() %>% 
    unnest_tokens(ngrams, value, "ngrams", n = 2)

benchmark("mutate+filter (small text)" = {df1 <- df %>%
        mutate(
            has_stop_word = str_detect(ngrams, stopwords_collapsed)
        ) %>%
        filter(!has_stop_word)},
          "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
        "mutate+filter (large text)" = {df3 <- dfexp %>%
            mutate(
                has_stop_word = str_detect(ngrams, stopwords_collapsed)
            ) %>%
            filter(!has_stop_word)},
        "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
          replications = 5,
          columns = c("test", "replications", "elapsed")
)

                           test replications elapsed
4 [] row selection (large text)            5   30.03
2 [] row selection (small text)            5    0.00
3    mutate+filter (large text)            5   30.64
1    mutate+filter (small text)            5    0.00

If you want to match the entire word and not a substring, you need to add in word boundaries. `swc <- paste(paste0("\\b", stopwords, "\\b"), collapse = "|")`. — Oliver Oliver, Jul 30 '21 at 05:42

Removing ngrams containing stopwords using tidytext

1 Answers1