2

I need to remove all non-English words from a data frame that looks like this:

ID     text
1      they all went to the store bonkobuns and bought chicken
2      if we believe no exomunch standards are in order then we're ok
3      living among the calipodians seems reasonable  
4      given the state of all relimited editions we should be fine

I want to end with a data frame as such:

 ID     text
 1      they all went to the store and bought chicken
 2      if we believe no standards are in order then we're ok
 3      living among the seems reasonable  
 4      given the state of all editions we should be fine

I have a vector containing all english words: word_vec

I can remove all words that are in a vector from a data frame using the tm package

for(k in 1:nrow(frame){
    for(i in 1:length(word_vec)){
        frame[k,] <- removeWords(frame[i,],word_vec[i])
    }
}

but I want to do the opposite. I want to 'keep' only the words found in the vector.

Cybernetic
  • 12,628
  • 16
  • 93
  • 132

3 Answers3

4

Here's a simple way to do it:

txt <- "Hi this is an example"
words <- c("this", "is", "an", "example")
paste(intersect(strsplit(txt, "\\s")[[1]], words), collapse=" ")
[1] "this is an example"

Of course the devil is in the details, so you might need to tweak things a little to take into account the apostrophes and other punctuation signs.

Dominic Comtois
  • 10,230
  • 1
  • 39
  • 61
2

You could try gsub

 word_vec <- paste(c('bonkobuns ', 'exomunch ', 'calipodians ', 
          'relimited '), collapse="|")
 gsub(word_vec, '', df1$text)
 #[1] "they all went to the store and bought chicken"        
 #[2] "if we believe no standards are in order then we're ok"
 #[3] "living among the seems reasonable"                    
 #[4] "given the state of all editions we should be fine" 

Suppose, if you already have a word_vec with just the opposite of that in the above vector, for example

  word_vec <- c("among", "editions", "bought", "seems", "fine", 
  "state", "in", 
  "then", "reasonable", "ok", "standards", "store", "order", "should", 
  "and", "be", "to", "they", "are", "no", "living", "all", "if", 
  "we're", "went", "of", "given", "the", "chicken", "believe", 
  "we")


  word_vec2 <-  paste(gsub('^ +| +$', '', gsub(paste(word_vec, 
        collapse="|"), '', df1$text)), collapse= ' |')
  gsub(word_vec2, '', df1$text)
  #[1] "they all went to the store and bought chicken"        
  #[2] "if we believe no standards are in order then we're ok"
  #[3] "living among the seems reasonable"                    
  #[4] "given the state of all  editions we should be fine"  
akrun
  • 874,273
  • 37
  • 540
  • 662
0

All I can think of is the following procedure:

  1. For each row in your vector split into vector by spaces strsplit()
  2. For each element in your new vector check if its any of your word_vec using regexpr()
  3. If the value for a specific position is returned as -1 (regexpr examples) delete that position.
  4. Join back the string and store in a new vector

maybe its worth pondering the function which() if you go down this road:

    which(c('a','b','c','d','e') == 'd')
[1] 4
ca_san
  • 183
  • 2
  • 11
  • This leaves me with an empty data frame. Note that there should be no reduction in the number of rows from the original dataframe. The text within any field will just have anything non-English missing. – Cybernetic Mar 06 '15 at 02:31
  • Have you tried reversing your condition with a not operator? – ca_san Mar 06 '15 at 02:39
  • With the removeWords function? Yes it isn't allowed. If there were a function that was the opposite of gsub that would also work. – Cybernetic Mar 06 '15 at 02:46
  • @Cybernetic How about using `grep()` and passing word_vec as a the pattern? – ca_san Mar 06 '15 at 03:02