1

I am trying to remove 'bad' email addresses from a csv. I have a column of emails that look like "abd@no.com," "123@none.com," "@," or "a". There is a wide range of email formats so I want to try to find and remove them all.

My inital idea is to look strictly at the end of the an email string - the "@..." part. Then also to look at the length of a character, so if the email is only of length 1 or 2 it is not valid.

If I have a list of bad emails, I want to generate a new list of emails where the bad ones are replaced with NA.

Below is the code that I have so far but it does not work and looks for exact matches on the pattern, not just the end of the string.

        email_clean <- function(email, invalid = NA)
        {
        email <- trimws(email)               # remove whitespace
        email[nchar(email) %in% c(1,2)] <- invalid
        bad_email <- c("\\@no.com", "\\@none.com","\\@email.com","\\@noemail.com")
        pattern = paste0("(?i)\\b",paste0(bad_email,collapse="\\b|\\b"),"\\b")
        emails <-gsub(pattern,"",sapply(csv_file$Email,as.character))
        email
        }

        Cleaned_Email <- email_clean(csv_file$Email)

Thank you for any help!!!

Maddie
  • 87
  • 1
  • 6
  • 3
    Why you are escaping `@`? Also, you are better to escape the dot (for instance `no\\.com`). Keep in mind that the pattern `no.com` matches with `no.com` but also with `noRcom` or `no com`. The dot represents any character in regex. – nicola Jan 07 '16 at 13:58

1 Answers1

2

Your function is pretty close. Just note a few tweaks:

email_clean <- function(email, invalid = NA)
{
        email <- trimws(email)               # remove whitespace
        email[nchar(email) %in% c(1,2)] <- invalid
        bad_email <- c("\\@no.com", "\\@none.com","\\@email.com","\\@noemail.com")
        pattern = paste0("(?i)\\b",paste0(bad_email,collapse="\\b|\\b"),"\\b")
        email <-gsub(pattern, invalid, sapply(email,as.character))
        unname(email)
}

emails <- c("pierre@gmail.com", "hi@none.com", "@", "a")
email_clean(emails)
# [1] "pierre@gmail.com" NA                 NA                
# [4] NA  
Pierre L
  • 28,203
  • 6
  • 47
  • 69