6

I want to remove multiple patterns from multiple character vectors. Currently I am going:

a.vector <- gsub("@\\w+", "", a.vector)
a.vector <- gsub("http\\w+", "", a.vector)
a.vector <- gsub("[[:punct:]], "", a.vector)

etc etc.

This is painful. I was looking at this question & answer: R: gsub, pattern = vector and replacement = vector but it's not solving the problem.

Neither the mapply nor the mgsub are working. I made these vectors

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")
substitute <- c("")

Neither mapply(gsub, remove, substitute, a.vector) nor mgsub(remove, substitute, a.vector) worked.

a.vector looks like this:

[4951] "@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "@stiphan: you are phenomenal.. #mental #Writing. httptxjwufmfg"   

I want:

[4951] "Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  
[4952] "you are phenomenal #mental #Writing"   `
Community
  • 1
  • 1
vagabond
  • 3,526
  • 5
  • 43
  • 76

4 Answers4

13

I know this answer is late on the scene but it stems from my dislike of having to manually list the removal patterns inside the grep functions (see other solutions here). My idea is to set the patterns beforehand, retain them as a character vector, then paste them (i.e. when "needed") using the regex seperator "|":

library(stringr)

remove <- c("@\\w+", "http\\w+", "[[:punct:]]")

a.vector <- str_remove_all(a.vector, paste(remove, collapse = "|"))

Yes, this does effectively do the same as some of the other answers here, but I think my solution allows you to retain the original "character removal vector" remove.

Marian Minar
  • 1,344
  • 10
  • 25
5

Try combining your subpatterns using |. For example

>s<-"@karakamen: Suicide amongst successful men is becoming rampant. Kudos for staing the conversation. #mental"
> gsub("@\\w+|http\\w+|[[:punct:]]", "", s)
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

But this could become problematic if you have a large number of patterns, or if the result of applying one pattern creates matches to others.

Consider creating your remove vector as you suggested, then applying it in a loop

> s1 <- s
> remove<-c("@\\w+","http\\w+","[[:punct:]]")
> for (p in remove) s1 <- gsub(p, "", s1)
> s1
[1] " Suicide amongst successful men is becoming rampant Kudos for staing the conversation #mental"

This approach will need to be expanded to apply it to the entire table or vector, of course. But if you put it into a function which returns the final string, you should be able to pass that to one of the apply variants

kdopen
  • 8,032
  • 7
  • 44
  • 52
1

In case the multiple patterns that you are looking for are fixed and don't change from case-to-case, you can consider creating a concatenated regex that combines all of the patterns into one uber regex pattern.

For the example you provided, you can try:

removePat <- "(@\\w+)|(http\\w+)|([[:punct:]])"

a.vector <- gsub(removePat, "", a.vector)
Chaos
  • 466
  • 1
  • 5
  • 12
-1

I had a vector with statement "my final score" and I wanted to keep on the word final and remove the rest. This what worked for me based on Marian suggestion:

str_remove_all("my final score", "my |score")

note: "my final score" is just an example. I was dealing with a vector.

seakyourpeak
  • 531
  • 1
  • 6
  • 18
  • This answer would benefit from formatting code as code and loading required libraries so the example can be reproduced. – rempsyc Aug 10 '21 at 21:27