UPDATE: Thanks for the input so far. I rewritten the question and added a better example to highlight the implicit requirements that were not covered in my first example.
Question
I am looking for a general tidy
solution to removing ngrams containing stopwords. In short, ngrams are strings of words separated by a space. A unigram contains 1 word, a bigram 2 words, and so on. My goal would be to apply this on a data frame after using unnest_tokens()
. The solution should work with a data frame containing a mix of ngrams of any length (uni, bi, tri..), or at least bi & tri and above.
- For more information on ngrams, see wiki: https://en.wikipedia.org/wiki/N-gram
- I am aware of this question: Remove ngrams with leading and trailing stopwords. However, I am looking for a general solution, that does not require the stopword to be leading or trailing and that would also scale somewhat nicely.
- as pointed out in the comments, there is a solution for bigrams documented here: https://www.tidytextmining.com/ngrams.html#counting-and-filtering-n-grams
New example data
ngram_df <- tibble::tribble(
~Document, ~ngram,
1, "the",
1, "the basis",
1, "basis",
1, "basis of culture",
1, "culture",
1, "is ground water",
1, "ground water",
1, "ground water treatment"
)
stopword_df <- tibble::tribble(
~word, ~lexicon,
"the", "custom",
"of", "custom",
"is", "custom"
)
desired_output <- tibble::tribble(
~Document, ~ngram,
1, "basis",
1, "culture",
1, "ground water",
1, "ground water treatment"
)
Created on 2019-03-21 by the reprex package (v0.2.1)
Desired behaviour
- the
ngram_df
should be transformed into thedesired_output
, using the stopwords from theword
column in thestopword_df
. - every row containing a stopword should be removed
- word boundaries should be respected (i.e. looking for
is
should not removebasis
)
my first attempt for a reprex below:
example data
library(tidyverse)
library(tidytext)
df <- "Groundwater remediation is the process that is used to treat polluted groundwater by removing the pollutants or converting them into harmless products." %>%
enframe() %>%
unnest_tokens(ngrams, value, "ngrams", n = 2)
#apply magic here
df
#> # A tibble: 21 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 remediation is
#> 3 1 is the
#> 4 1 the process
#> 5 1 process that
#> 6 1 that is
#> 7 1 is used
#> 8 1 used to
#> 9 1 to treat
#> 10 1 treat polluted
#> # ... with 11 more rows
example list of stopwords
stopwords <- c("is", "the", "that", "to")
desired output
#> Source: local data frame [9 x 2]
#> Groups: <by row>
#>
#> # A tibble: 9 x 2
#> name ngrams
#> <int> <chr>
#> 1 1 groundwater remediation
#> 2 1 treat polluted
#> 3 1 polluted groundwater
#> 4 1 groundwater by
#> 5 1 by removing
#> 6 1 pollutants or
#> 7 1 or converting
#> 8 1 them into
#> 9 1 harmless products
Created on 2019-03-20 by the reprex package (v0.2.1)
(example sentence from: https://en.wikipedia.org/wiki/Groundwater_remediation)