4

UPDATE: Thanks for the input so far. I rewritten the question and added a better example to highlight the implicit requirements that were not covered in my first example.

Question I am looking for a general tidy solution to removing ngrams containing stopwords. In short, ngrams are strings of words separated by a space. A unigram contains 1 word, a bigram 2 words, and so on. My goal would be to apply this on a data frame after using unnest_tokens(). The solution should work with a data frame containing a mix of ngrams of any length (uni, bi, tri..), or at least bi & tri and above.

New example data

ngram_df <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                    "the",
          1,              "the basis",
          1,                  "basis",
          1,       "basis of culture",
          1,                "culture",
          1,        "is ground water",
          1,           "ground water",
          1, "ground water treatment"
  )
stopword_df <- tibble::tribble(
  ~word, ~lexicon,
  "the", "custom",
   "of", "custom",
   "is", "custom"
  )
desired_output <- tibble::tribble(
  ~Document,                   ~ngram,
          1,                  "basis",
          1,                "culture",
          1,           "ground water",
          1, "ground water treatment"
  )

Created on 2019-03-21 by the reprex package (v0.2.1)

Desired behaviour

  • the ngram_df should be transformed into the desired_output, using the stopwords from the word column in the stopword_df.
  • every row containing a stopword should be removed
  • word boundaries should be respected (i.e. looking for is should not remove basis)


my first attempt for a reprex below:

example data

library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>% 
  enframe() %>% 
  unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here

df
#> # A tibble: 21 x 2
#>     name ngrams                 
#>    <int> <chr>                  
#>  1     1 groundwater remediation
#>  2     1 remediation is         
#>  3     1 is the                 
#>  4     1 the process            
#>  5     1 process that           
#>  6     1 that is                
#>  7     1 is used                
#>  8     1 used to                
#>  9     1 to treat               
#> 10     1 treat polluted         
#> # ... with 11 more rows

example list of stopwords

stopwords <- c("is", "the", "that", "to")

desired output

#> Source: local data frame [9 x 2]
#> Groups: <by row>
#> 
#> # A tibble: 9 x 2
#>    name ngrams                 
#>   <int> <chr>                  
#> 1     1 groundwater remediation
#> 2     1 treat polluted         
#> 3     1 polluted groundwater   
#> 4     1 groundwater by         
#> 5     1 by removing            
#> 6     1 pollutants or          
#> 7     1 or converting          
#> 8     1 them into              
#> 9     1 harmless products

Created on 2019-03-20 by the reprex package (v0.2.1)

(example sentence from: https://en.wikipedia.org/wiki/Groundwater_remediation)

Benjamin Schwetz
  • 624
  • 5
  • 17
  • 1
    I guess the reason is to avoid getting "false" ngrams. For example if you have the sentence "Sky is blue" and you remove is before finding bigrams, you would end up with finding sky blue, which wouldn't be a true bigram if stopwords were considered. Maybe a workaround could be to replace all stopwords by one and the same unique placeholder-string before finding ngrams and afterwards deleating all ngrams containing the placeholder-string? – TinglTanglBob Mar 20 '19 at 15:29
  • Correct @TinglTanglBob – Benjamin Schwetz Mar 20 '19 at 16:06
  • 2
    Check out "Text Mining with R" by Silge and Robinson, specifically this section here: https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams – Marian Minar Mar 20 '19 at 16:24
  • thanks, @MarianMinar. That's a good start. I'd like to see a version though that could handle words and bigrams & trigrams in one go. I realize that this is not obvious from my example above. I will try to update the question today. – Benjamin Schwetz Mar 21 '19 at 08:18

1 Answers1

2

Here you have another way using the "stopwords_collapsed" from the previous answer:

swc <- paste(stopwords, collapse = "|")
df <- df[str_detect(df$ngrams, swc) == FALSE, ] #select rows without stopwords

df
# A tibble: 8 x 2
   name ngrams                 
  <int> <chr>                  
1     1 groundwater remediation
2     1 treat polluted         
3     1 polluted groundwater   
4     1 groundwater by         
5     1 by removing            
6     1 pollutants or          
7     1 or converting          
8     1 harmless products 

Here you have a simple benchmark comparing both systems:

#benchmark
txtexp <- rep(txt,1000000)
dfexp <- txtexp %>% 
    enframe() %>% 
    unnest_tokens(ngrams, value, "ngrams", n = 2)

benchmark("mutate+filter (small text)" = {df1 <- df %>%
        mutate(
            has_stop_word = str_detect(ngrams, stopwords_collapsed)
        ) %>%
        filter(!has_stop_word)},
          "[] row selection (small text)" = {df2 <- df[str_detect(df$ngrams, stopwords_collapsed) == FALSE, ]},
        "mutate+filter (large text)" = {df3 <- dfexp %>%
            mutate(
                has_stop_word = str_detect(ngrams, stopwords_collapsed)
            ) %>%
            filter(!has_stop_word)},
        "[] row selection (large text)" = {df4 <- dfexp[str_detect(dfexp$ngrams, stopwords_collapsed) == FALSE, ]},
          replications = 5,
          columns = c("test", "replications", "elapsed")
)

                           test replications elapsed
4 [] row selection (large text)            5   30.03
2 [] row selection (small text)            5    0.00
3    mutate+filter (large text)            5   30.64
1    mutate+filter (small text)            5    0.00
  • 1
    If you want to match the entire word and not a substring, you need to add in word boundaries. `swc <- paste(paste0("\\b", stopwords, "\\b"), collapse = "|")`. – Oliver Oliver Jul 30 '21 at 05:42