1

I have a character vector (myVector) which contains several instances of email addresses scattered through a long string of semi-cleaned HTML stored in a single entry in the vector.

I know the relevant domain name ("@domain.com") and I want to extract each email address associated with that domain name (e.g. "help@domain.com") preceded by white space.

I have tried the following code, but it doesn't deliver the right substring indices:

gregexpr("\\s .+?@domain.com", myVector)

Any thoughts on (a) how I can fix the regular expression, and (b) whether there is a more elegant solution?

ajrwhite
  • 7,728
  • 1
  • 11
  • 24
  • Relevant http://stackoverflow.com/questions/24395382/r-code-removing-words-containing/24395558#24395558 – hwnd Dec 14 '15 at 05:12

3 Answers3

1

I tried to replicate your question with a small example by creating a single string that has a few emails included in it.

> foo = "thing1@gmail.com some filler text to use an thing2@gmail.com example for this 
thing3@gmail.com question thing4@gmail.com that OP has has asked"

> strsplit(foo, " ")
[[1]]
 [1] "thing1@gmail.com"       "some"                   "filler"                
 [4] "text"                   "to"                     "use"                   
 [7] "an"                     "thing2@gmail.com"       "example"               
[10] "for"                    "this\nthing3@gmail.com" "question"              
[13] "thing4@gmail.com"       "that"                   "OP"                    
[16] "has"                    "has"                    "asked"

> strsplit(foo, " ")[[1]][grep("@gmail.com", strsplit(foo, " ")[[1]])]

[1] "thing1@gmail.com"       "thing2@gmail.com"       "this\nthing3@gmail.com"
[4] "thing4@gmail.com" 
Nancy
  • 3,989
  • 5
  • 31
  • 49
1

Using grep and value = TRUE:

str1 <-"Long text with email addresses help@domain.com and info@domain.com throughout help@other.com"
str1 <-unlist(strsplit(str1, " ")) #split on spaces
grep("@domain.com", str1, value = TRUE)
#[1] "help@domain.com" "info@domain.com"
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56
1

You want space followed by no-spaces so gregexpr("\\s\\S+@domain.com", myVector) should be fine (but it counts extra space on start).

As an alternative solution take look at stringr package:

library(stringr)
str_extract_all(myVector, "\\s\\S+@domain.com")

Or use str_extract_all(myVector, "\\S+@domain.com") which returns also adressed at the start of the string (and without extra space).

Examples:

myVector <- "one@domain.com and two@domain.com and three@domain.com. What about:four@domain.com and five@domain.com"
gregexpr("\\s\\S+@domain.com", myVector)
# [[1]]
# [1] 19 38 61 87
# attr(,"match.length")
# [1] 15 17 22 16
# attr(,"useBytes")
# [1] TRUE

str_extract_all(myVector, "\\s\\S+@domain.com")
# [1] " two@domain.com"        " three@domain.com"      " about:four@domain.com"
# [4] " five@domain.com"   

str_extract_all(myVector, "\\S+@domain.com")
# [1] "one@domain.com"        "two@domain.com"        "three@domain.com"     
# [4] "about:four@domain.com" "five@domain.com"      

(about:four is some corner case to think about)

Marek
  • 49,472
  • 15
  • 99
  • 121
  • 1
    `regmatches(myVector, gregexpr("\\w+@domain.com", myVector))[[1]]` works for that case – rawr Dec 14 '15 at 07:28
  • @rawr `catch.me.if.you.can+nastyalias@domain.com` – Marek Dec 14 '15 at 09:36
  • [my comment was not meant to match and extract every valid email address](https://i.stack.imgur.com/SrUwP.png) – rawr Dec 14 '15 at 14:12
  • @rawr Nice graph. I just give counter-example. It's hard job to do this well, but for quick analysis it's better to do wide search and then filter results. Using space (and only space) as indicator of en email address in text give you almost every one from text. Then you could narrow them to correct ones using dedicated regexp. Your comment (good one) won't catch, not so uncommon, address with dot or (less common) with plus/minus sign. No hard feelings. – Marek Dec 14 '15 at 20:10