1

I have a set of words that I would like to exclude from my analysis. For example,

trash<- c("de" , "do", "das", ...., "da") # this set can be with n elements 

Also, I have a data.frame named matc with two variables v1 and v2 which I would like to apply the replacements of each word in trash by nothing.

When I tried to do this using the following code:

for(k in 1:length(pr_us))
 {
   matc$V1<- gsub(pr_us[k],  "" , matc$V1 )
   matc$V2<- gsub(pr_us[k],  "" , matc$V2 )
 }

the replacement isn't exact. In other words, if matc$V1 is "Maria da Graça Madalena", the result is "Maria Graça Malena" and I would like the following result "Maria Graça Madalena". I tried something like this

for(k in 1:length(pr_us))
{
  matc$V1<- gsub( paste0(pr_us[k], "\bb") , "" , matc$V1 )
  matc$V2<- gsub( paste0(pr_us[k], "\bb") , "" , matc$V2 )
}

But, this also does not work.

Is there some solution avoiding the loop? Some solution with the apply functions...

lmo
  • 37,904
  • 9
  • 56
  • 69
MAOC
  • 625
  • 2
  • 8
  • 26

1 Answers1

1

Since you are matching word, it is more reasonable to include space before and after the trash word. So for the specific example OP gives, it can be:

gsub("\\s+da\\s+", " ", "Maria da Graça Madalena")
[1] "Maria Graça Madalena"
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    A word boundary `\\b` would be more appropriate than a space in case there is punctuation or the word is the first or last in the string. – Gregor Thomas Jun 10 '16 at 16:48