Removing words featured in character vector from string

Question

I have a character vector of stopwords in R:

stopwords = c("a" ,
            "able" ,
            "about" ,
            "above" ,
            "abst" ,
            "accordance" ,
            ...
            "yourself" ,
            "yourselves" ,
            "you've" ,
            "z" ,
            "zero")

Let's say I have the string:

str <- c("I have zero a accordance")

How can remove my defined stopwords from str?

I think gsub or another grep tool could be a good candidate to pull this off, although other recommendations are welcome.

@akrun better `gsub(paste0("\\b(",paste(stopwords, collapse="|"),")\\b"), "", str)`, otherwise every `a` will be deleted. — nicola, Mar 04 '16 at 08:05

Mikko · Answer 1 · 2016-03-05T19:24:18.380

27

Try this:

str <- c("I have zero a accordance")

stopwords = c("a", "able", "about", "above", "abst", "accordance", "yourself",
"yourselves", "you've", "z", "zero")

x <- unlist(strsplit(str, " "))

x <- x[!x %in% stopwords]

paste(x, collapse = " ")

# [1] "I have"

Addition: Writing a "removeWords" function is simple so it is not necessary to load an external package for this purpose:

removeWords <- function(str, stopwords) {
  x <- unlist(strsplit(str, " "))
  paste(x[!x %in% stopwords], collapse = " ")
}

removeWords(str, stopwords)
# [1] "I have"

edited Mar 05 '16 at 19:24

answered Mar 04 '16 at 07:54

Mikko

7,530
8
55
92

3

I find this way better than the implemented function in the tm package, because the latter has size limits. I work with a corpus of forum comments and wanted to remove all the usernames from the text (around 70000). I kept getting an error from R, because the regex was too large. Thank you! – Anastasia Pupynina May 20 '17 at 15:29
1

This solution is way faster than the `tm`package! Thanks for sharing!! – Peter Aug 04 '19 at 15:42
Note that this is case-sensitive, so blacklisted words starting a sentence (for example) will not be removed. It also splits by spaces, so blacklisted words next to any punctuation won't be removed either. Whether this is a big deal or not depends on your purposes. For the first issue, either change line 2 to `x <- tolower(unlist(strsplit(str, " ")))` to make everything lower-case to match the stopword list (if preserving capitalisation is not important), or duplicate the stopword list to have a copy of each word starting with a capital letter (if capitalisation is important). – DuckPyjamas Mar 08 '23 at 19:34

RHertel · Accepted Answer · 2016-03-04T08:24:47.033

21

You could use the tm library for this:

require("tm")
removeWords(str,stopwords)
#[1] "I have   "

edited Mar 04 '16 at 08:24

answered Mar 04 '16 at 08:06

RHertel

23,412
5
38
64

Harrison Jones · Answer 3 · 2022-12-21T17:16:18.143

Here is another option for a function if you want the code to be vectorized for many sentences, not just one. It borrows content from Mikko's original answer.

remove_words <- function(str, words) {
      
  purrr::map_chr(
    str, 
    function(sentence) {
      sentence_split <- unlist(strsplit(sentence, " "))
      paste(sentence_split[!sentence_split %in% words], collapse = " ")
    }
  )
      
}
    
remove_words(c('Hello world', 'This is another sentence', 'Test sentence 3'), c('world', 'sentence'))

Removing words featured in character vector from string

3 Answers3

Linked