R how to detect if a string contains an email address and extract the email address and the previous 5 words?

Question

R regular expression question: I have a data.frame of job title and job descriptions and I need to

1) check if a job description contains an email address (can be .org, .edu, .gov, .com), and

2) extract the email address and the 5 words that precedes the email address

The dataset can contain web urls which can end in .edu, .com, etc. and also contains returns. Basically I was hoping to identify email address as anything that has [letters/numbers]@[letters/numbers](.org, .edu, .gov, .com, and whatever else an email can end in)

Here is a sample dataset:

    teststr = data.frame(job_title = c(1:8),
                 job_description = c('please send your resumes to adsf@dsf.com apply now!',
                                   'asdfa@asdf.com/adsf asdf',
                                   'visit us at sfds@adfa',
                                   'apply now',
                                   'follow us on @asdf.gov',
                                   'asdfa.gov',
                                   '.com',
                                   ''))



> teststr
  job_title                                     job_description
1         1 please send your resumes to adsf@dsf.com apply now!
2         2                            asdfa@asdf.com/adsf asdf
3         3                               visit us at sfds@adfa
4         4                                           apply now
5         5                              follow us on @asdf.gov
6         6                                           asdfa.gov
7         7                                                .com
8         8

I attempted at (1), but got the wrong answer

    grepl('(*@.+\\.com)|(*@\\S\\.gov)', teststr$job_description)

The correct result to (1) should be

      TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

related: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression — acylam, Oct 31 '18 at 19:10
This becomes a question of how specifically you want to match particular email address patterns, but this should work `grepl('(.+@.+\\.com)|(.+@.+\\.gov)', teststr$job_description)` — Mako212, Oct 31 '18 at 19:16
Starting each pattern with `.+` because any email address needs at least one character before `@` — Mako212, Oct 31 '18 at 19:16

score 0 · Answer 1 · answered Oct 31 '18 at 19:11

0

The following pattern should match most email address formats:

([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)

To capture the five preceding words, split your string on the pattern, and then again on whitespace, and capture the up to 6 (inclusive) elements.

answered Oct 31 '18 at 19:11

Tim

2,756
1
15
31

This can all be done in one pattern with no splitting involved – emsimpson92 Oct 31 '18 at 19:17
@emsimpson92 how can I do it in one pattern with no splitting involved. Splitting is not very practical because the resulting data.frame would take up a lot of memory. – Amazonian Oct 31 '18 at 19:39
@Amazonian I provided an answer with an example of how it could be done. – emsimpson92 Oct 31 '18 at 20:09

score 0 · Accepted Answer · answered Oct 31 '18 at 19:16

0

This should work for you. (?:\w+ ){0,5}\w+@\w+\.(?:com|gov|edu|org)

Here is a demo

answered Oct 31 '18 at 19:16

emsimpson92

1,779
1
9
24

Any specific reason for the downvote? If you click the demo link it meets the specified criteria. – emsimpson92 Oct 31 '18 at 20:08
your demo was very helpful! Can you please explain '?:' does? – Amazonian Nov 01 '18 at 02:39
@Amazonian `(?:...)` is a non capturing group. It simply means that the contents of that group aren't saved as a group for later use. The reason I did this is so I can match words followed by a space 0 to 5 times. – emsimpson92 Nov 05 '18 at 18:23

jasbner · Answer 3 · 2018-10-31T20:16:11.260

0

Here's a stringr example to get the strings. If you just need T/F you can do grepl.

library(stringr)
str_extract(teststr$job_description,"(\\w+ ){0,5}(\\w+)?@\\w+\\.(com|org|edu|gov)")
# [1] "please send your resumes to adsf@dsf.com" "asdfa@asdf.com"                          
# [3] NA                                         NA                                        
# [5] "follow us on @asdf.gov"                   NA                                        
# [7] NA                                         NA 


grepl("(\\w+ ){0,5}(\\w+)?@\\w+\\.(com|org|edu|gov)",teststr$job_description)
# [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE

edited Oct 31 '18 at 20:16

answered Oct 31 '18 at 19:21

jasbner

2,253
12
24

Keep in mind that this will match email addresses such as `email@emailxcom`. You want to be sure to escape your `.`, but this doesn't really seem any different from my answer. – emsimpson92 Oct 31 '18 at 20:13
ahh yes good catch i think it should be `\\.(com|org|edu|gov)")` – jasbner Oct 31 '18 at 20:15
@jasbner actually, "follow us on @asdf.gov" should return FALSE (I made a mistake in my question) because the character immediately preceding @ is a space and not a character. How should I revise this answer to make sure that the email address need to have a non-space character that precedes @? – Amazonian Nov 01 '18 at 02:20
Just remove the `?` after `(\\w+)`. The question mark indicates optional. – jasbner Nov 01 '18 at 13:08

R how to detect if a string contains an email address and extract the email address and the previous 5 words?

3 Answers3