1

I work with gmailR and I need to extract emails from brackets <> (sometimes few in one row) but in case when there are no brackets (e.g. name@mail.com) I need to keep those elements.

This is an example

x2 <- c("John Smith <jsmith@company.ch>  <abrown@company.ch>","no-reply@cdon.com" ,
        "<rikke.hc@hotmail.com>")

I need output like:

[1] "jsmith@company.ch"       "abrown@company.ch"
[2] "no-reply@cdon.com"
[3] "rikke.hc@hotmail.com"

I tried this in purpose to merge that 2 results

library("qdapRegex")
y1 <- ex_between(x2, "<", ">", extract = FALSE)
y2 <- rm_between(x2, "<", ">", extract = TRUE )

My data code sample:

from <- sapply(msgs_meta, gm_from)
from[sapply(from, is.null)] <- NA
from1 <- rm_bracket(from)
from2 <- ex_bracket(from)

gmail_DK <- gmail_DK %>% 
  mutate(from = unlist(y1)) %>%
  mutate(from = unlist(y2))

but when I use this function to my data (only one day emails) and unlist I get

Error in mutate(): ! Problem while computing cc = unlist(cc2). x cc must be size 103 or 1, not 104. Run rlang::last_error() to see where the error occurred.

I suppose that in data from more days difference should be bigger, so I prefer to not go this way.

Preferred answer in R but if you know how to make it in for example PowerQuery should be great too.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Igniste
  • 108
  • 6

2 Answers2

2

We may also use base R - split the strings at the space that follows the > (strsplit) and then capture the substring between the < and > in sub (in the replacement, we specify the backreference (\\1) of the captured group) - [^>]+ - implies one or more characters that are not a >

sub(".*<([^>]+)>", "\\1", unlist(strsplit(x2, 
       "(?<=>)\\s+", perl = TRUE)))
[1] "jsmith@company.ch"    "abrown@company.ch"  
[3]  "no-reply@cdon.com"    "rikke.hc@hotmail.com"
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It work well with "from", but with "to" and "cc" when can be multiple emails, then it didn't work with more that 2 values in cc or from. For example `[1] "Company Danmark , Booking , Smith- Fair Cargo Now , John Surname ` " It return only last value 'mail+john@company.dk' – Igniste Mar 14 '22 at 09:44
  • If you have other test cases, you need to show them to us (edit your question to include them) – Ben Bolker Mar 14 '22 at 12:58
1

Clunky but OK?

(x2 
   ## split into single words/tokens
   %>% strsplit(" ")
   %>% unlist()
   ## find e-mail-like strings, with or without brackets
   %>% stringr::str_extract("<?[\\w-.]+@[\\w-.]+>?") 
   ## drop elements with no e-mail component
   %>% na.omit()   
   ## strip brackets
   %>% stringr::str_remove_all("[<>]")
)
Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • I'm not sure what all of the legal elements in an e-mail string are: it's possible that `[^@]+` would be better for capturing the components before and after the `@` ... – Ben Bolker Mar 11 '22 at 15:24
  • Sorry, Im total newby in regex. You mean stringr::str_extract("[\\w-.][^@]+@[^@][\\w-.]+>?")? If yes, in my case it work same. – Igniste Mar 14 '22 at 08:52
  • U function work well with "from", but with "to" and "cc" when values can be null then are some problems from na.omit() – Igniste Mar 14 '22 at 08:58