2

I have a Vector Corpus in R. I want to remove all email ID's appearing in that corpus. The email IDs can be at any position in the corpus. Say e.g.

1> "Could you mail me the Company policy amendments at xyz@gmail.com. Thank you." 

2> "Please send me an invoice copy at abcdef@yahoo.co.in. Looking forward to your reply". 

So here I want the email IDs "xyz@gmail.com" and "abcdef@yahoo.co.in" to be removed from the corpus only.

I have tried using :

corpus <- tm_map(corpus,removeWords,"\w*gmail.com\b")
corpus <- tm_map(corpus,removeWords,"\w*yahoo.co.in\b")
  • Using a regular expression to match email addresses is not as simple as it might look. Check this question and its answers for a long discussion and some examples: http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address – Molx Nov 30 '15 at 11:55

2 Answers2

6

The code below uses Regex pattern to remove email id's from a corpus. I had got the Regex from some where and am currently not able to recall where it was from. I would have loved to give credit to the source.

# Sample data from which email ids need to be removed

text <- c("Could you mail me the Company policy amendments at xyz@gmail.com. Thank you.",
          "Please send me an invoice copy at abcdef@yahoo.co.in. Looking forward to your reply." )


#Function containing regex pattern to remove email id
RemoveEmail <- function(x) {
  require(stringr)
  str_replace_all(x,"[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+", "")
} 

library(tm)
corpus =  Corpus(VectorSource(text)) # Corpus creation
corpus <- tm_map(corpus,content_transformer(RemoveEmail)) # removing email ids

#Printing the corpus
corpus[[1]]$content
corpus[[2]]$content
amitkb3
  • 303
  • 4
  • 14
0

Removing all rows in R with an invalid email in a particular column:

DF <- subset(DF, Column!="[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\\.[a-zA-Z0-9-.]+")
Ruben Portz
  • 174
  • 1
  • 10