How do I replace exact set of words?

Question

I have a set of words that I would like to exclude from my analysis. For example,

trash<- c("de" , "do", "das", ...., "da") # this set can be with n elements

Also, I have a data.frame named matc with two variables v1 and v2 which I would like to apply the replacements of each word in trash by nothing.

When I tried to do this using the following code:

for(k in 1:length(pr_us))
 {
   matc$V1<- gsub(pr_us[k],  "" , matc$V1 )
   matc$V2<- gsub(pr_us[k],  "" , matc$V2 )
 }

the replacement isn't exact. In other words, if matc$V1 is "Maria da Graça Madalena", the result is "Maria Graça Malena" and I would like the following result "Maria Graça Madalena". I tried something like this

for(k in 1:length(pr_us))
{
  matc$V1<- gsub( paste0(pr_us[k], "\bb") , "" , matc$V1 )
  matc$V2<- gsub( paste0(pr_us[k], "\bb") , "" , matc$V2 )
}

But, this also does not work.

Is there some solution avoiding the loop? Some solution with the apply functions...

http://stackoverflow.com/questions/22888646/making-gsub-only-replace-entire-words — Hack-R, Jun 10 '16 at 16:16
Are you doing text mining? The `tm` package has functions ( `removeWords()` in particular ) that make this easy. — Bryan Goggin, Jun 10 '16 at 16:36

score 1 · Answer 1 · answered Jun 10 '16 at 16:36

1

Since you are matching word, it is more reasonable to include space before and after the trash word. So for the specific example OP gives, it can be:

gsub("\\s+da\\s+", " ", "Maria da Graça Madalena")
[1] "Maria Graça Madalena"

answered Jun 10 '16 at 16:36

Psidom

209,562
33
339
356

1

A word boundary `\\b` would be more appropriate than a space in case there is punctuation or the word is the first or last in the string. – Gregor Thomas Jun 10 '16 at 16:48

How do I replace exact set of words?

1 Answers1