4

I have a variable a created by readLines of a file which contains some emails. I already filtered only those rows whith the @ symbol, and now am struggling to grab the emails. The text in my variable looks like this:

> dput(a[1:5])
c("buenas tardes. excelente. por favor a: Saolonm@hotmail.com", 
"26.leonard@gmail.com ", "Aprecio tu aporte , mi correo es jcdavola31@gmail.com , Muchas Gracias", 
"gracias andrescarnederes@headset.cl", "Me apunto, muchas gracias mi dirección luciana.chavela.ecuador@gmail.com me será de mucha utilidad. "
)

From this question in SO I got a starting point to extract the emails (@Aaron Haurun's answer), which slightly modified (I added a [\w.] before the @ to address emails with . between names) worked well in regex101.com to extract the emails. However, it fails when I port it to gsub:

> gsub("()(\\w[\\w.]+@[\\w.-]+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+)()", 
       "\\2", 
       a[1:5], 
       perl = FALSE) ## It doesn't matter if I use perl = TRUE

[1] "buenas tardes. excelente. por favor a: Saolonm@hotmail.com"           "26.leonard@gmail.com "                                                                          
[3] "Aprecio tu aporte , mi correo es jcdavola31@gmail.com , Muchas Gracias"                           "gracias andrescarnederes@headset.cl"                                                                       
[5] "Me apunto, muchas gracias mi dirección luciana.chavela.ecuador@gmail.com me será de mucha utilidad. "

What am I doing wrong and how can I grab those emails? Thanks!

Community
  • 1
  • 1
PavoDive
  • 6,322
  • 2
  • 29
  • 55
  • 1
    Use stringr `str_extract` with something like `"\\S+@[^\\s@.]+\\.\\S+"`. There may be a lot of other email extraction regexps (just search SO) – Wiktor Stribiżew Jun 07 '16 at 13:57

4 Answers4

6

We can try the str_extract() from stringr package:

str_extract(text, "\\S*@\\S*")

[1] "Saolonm@hotmail.com"              
[2] "26.leonard@gmail.com"             
[3] "jcdavola31@gmail.com"             
[4] "andrescarnederes@headset.cl"      
[5] "luciana.chavela.ecuador@gmail.com"

where \\S* match any number of non-space character.

Psidom
  • 209,562
  • 33
  • 339
  • 356
3

From the answer you posted in your question,

library(stringr)
str_extract(a, '\\S+@\\S+|\\{(?:\\w+, *)+\\w+\\}@[\\w.-]+')
#[1] "Saolonm@hotmail.com"               "26.leonard@gmail.com"              "jcdavola31@gmail.com"              "andrescarnederes@headset.cl"      
#[5] "luciana.chavela.ecuador@gmail.com"
Sotos
  • 51,121
  • 6
  • 32
  • 66
2

We can use base R options to do this

unlist(regmatches(a, gregexpr("\\S+@\\S+", a)))
#[1] "Saolonm@hotmail.com"    
#[2]"26.leonard@gmail.com" 
#[3] "jcdavola31@gmail.com"             
#[4] "andrescarnederes@headset.cl"
#[5] "luciana.chavela.ecuador@gmail.com"

Or as the OP's post is about a solution with gsub/sub

sub("(.*\\s+|^)(\\S+@\\S+).*", "\\2", a)
#[1] "Saolonm@hotmail.com" 
#[2] "26.leonard@gmail.com" 
#[3] "jcdavola31@gmail.com"             
#[4] "andrescarnederes@headset.cl"  
#[5] "luciana.chavela.ecuador@gmail.com"
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Here is another approach that can be considered :

extract_Emails <- function(text)
{
  vector_Pattern_Extension <- c("ae", "ai", "app", "ar", "at", "au", "az",        
                                "bd", "be", "bg", "biz", "br", "by", "bz",        
                                "ca", "capital", "care", "cc", "ch", "cl", "club",      
                                "cn", "co", "coach", "com", "cz", "de", "digital",    
                                "dk", "edu", "email", "es", "eu", "exchange", "expert",    
                                "fi", "fr", "fund", "ga", "ge", "gr", "group",      
                                "hk", " hr", "hu", "id", "ie", "il", "in",        
                                "info", "investments", "io", "ir", "it", "is", "jp",        
                                "ke", "kr", "kz", "live", "lk", "lt", "ltd",        
                                "lv", "ma", "md", "me", "mk", "mx", "my",        
                                "net", "ng", "nl", "no", "np", "nz", "online",    
                                "org", "pe", "ph", "pk", "pl", "pro", "pt",        
                                "ro", "rs", "ru", "run", "sa", "se", "sg",        
                                "shop", "site", "sk", "store", "su", "tech", "th",        
                                "tips", "tn", "top", "tr", "trade", "tv", "tw",        
                                "ua", "uk", "us", "uz", "vip", "vn", "xyz",        
                                "za", "\U0440\u0444" )
  
  regex_Extension <- paste0(vector_Pattern_Extension, collapse = "|")
  
  vector_Pattern_Regex_Email <- c('(?<=^|\\s|\\(|;|:|\"|<|“|”)[A-z0-9._%+-]+@[A-z0-9\\.-]+\\.(?i)(',
                                  regex_Extension, ')(?![a-zàâçéèêëîïôûùüÿñæœ](?-i))')
  
  regex_Email <- paste0(vector_Pattern_Regex_Email, collapse = "")
  emails <- stringr::str_extract_all(text, regex_Email)[[1]]
  
  return(emails)
}

vec_Text <- c("buenas tardes. excelente. por favor a: Saolonm@hotmail.com", 
              "26.leonard@gmail.com ", "Aprecio tu aporte , mi correo es jcdavola31@gmail.com , Muchas Gracias", 
              "gracias andrescarnederes@headset.cl", "Me apunto, muchas gracias mi dirección luciana.chavela.ecuador@gmail.com me será de mucha utilidad. ")

for(i in vec_Text) print(extract_Emails(i))

[1] "Saolonm@hotmail.com"
[1] "26.leonard@gmail.com"
[1] "jcdavola31@gmail.com"
[1] "andrescarnederes@headset.cl"
[1] "luciana.chavela.ecuador@gmail.com"
Emmanuel Hamel
  • 1,769
  • 7
  • 19