I am trying to remove 'bad' email addresses from a csv. I have a column of emails that look like "abd@no.com," "123@none.com," "@," or "a". There is a wide range of email formats so I want to try to find and remove them all.
My inital idea is to look strictly at the end of the an email string - the "@..." part. Then also to look at the length of a character, so if the email is only of length 1 or 2 it is not valid.
If I have a list of bad emails, I want to generate a new list of emails where the bad ones are replaced with NA.
Below is the code that I have so far but it does not work and looks for exact matches on the pattern, not just the end of the string.
email_clean <- function(email, invalid = NA)
{
email <- trimws(email) # remove whitespace
email[nchar(email) %in% c(1,2)] <- invalid
bad_email <- c("\\@no.com", "\\@none.com","\\@email.com","\\@noemail.com")
pattern = paste0("(?i)\\b",paste0(bad_email,collapse="\\b|\\b"),"\\b")
emails <-gsub(pattern,"",sapply(csv_file$Email,as.character))
email
}
Cleaned_Email <- email_clean(csv_file$Email)
Thank you for any help!!!