0

R regular expression question: I have a data.frame of job title and job descriptions and I need to

1) check if a job description contains an email address (can be .org, .edu, .gov, .com), and

2) extract the email address and the 5 words that precedes the email address

The dataset can contain web urls which can end in .edu, .com, etc. and also contains returns. Basically I was hoping to identify email address as anything that has [letters/numbers]@[letters/numbers](.org, .edu, .gov, .com, and whatever else an email can end in)

Here is a sample dataset:

    teststr = data.frame(job_title = c(1:8),
                 job_description = c('please send your resumes to adsf@dsf.com apply now!',
                                   'asdfa@asdf.com/adsf asdf',
                                   'visit us at sfds@adfa',
                                   'apply now',
                                   'follow us on @asdf.gov',
                                   'asdfa.gov',
                                   '.com',
                                   ''))



> teststr
  job_title                                     job_description
1         1 please send your resumes to adsf@dsf.com apply now!
2         2                            asdfa@asdf.com/adsf asdf
3         3                               visit us at sfds@adfa
4         4                                           apply now
5         5                              follow us on @asdf.gov
6         6                                           asdfa.gov
7         7                                                .com
8         8                                                    

I attempted at (1), but got the wrong answer

    grepl('(*@.+\\.com)|(*@\\S\\.gov)', teststr$job_description)

The correct result to (1) should be

      TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
Amazonian
  • 391
  • 2
  • 8
  • 22
  • related: https://stackoverflow.com/questions/201323/how-to-validate-an-email-address-using-a-regular-expression – acylam Oct 31 '18 at 19:10
  • This becomes a question of how specifically you want to match particular email address patterns, but this should work `grepl('(.+@.+\\.com)|(.+@.+\\.gov)', teststr$job_description)` – Mako212 Oct 31 '18 at 19:16
  • Starting each pattern with `.+` because any email address needs at least one character before `@` – Mako212 Oct 31 '18 at 19:16

3 Answers3

0

The following pattern should match most email address formats:

([a-zA-Z0-9_\-\.]+)@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([a-zA-Z0-9\-]+\.)+))([a-zA-Z]{2,4}|[0-9]{1,3})(\]?)

To capture the five preceding words, split your string on the pattern, and then again on whitespace, and capture the up to 6 (inclusive) elements.

Tim
  • 2,756
  • 1
  • 15
  • 31
0

This should work for you. (?:\w+ ){0,5}\w+@\w+\.(?:com|gov|edu|org)

Here is a demo

emsimpson92
  • 1,779
  • 1
  • 9
  • 24
  • Any specific reason for the downvote? If you click the demo link it meets the specified criteria. – emsimpson92 Oct 31 '18 at 20:08
  • your demo was very helpful! Can you please explain '?:' does? – Amazonian Nov 01 '18 at 02:39
  • @Amazonian `(?:...)` is a non capturing group. It simply means that the contents of that group aren't saved as a group for later use. The reason I did this is so I can match words followed by a space 0 to 5 times. – emsimpson92 Nov 05 '18 at 18:23
0

Here's a stringr example to get the strings. If you just need T/F you can do grepl.

library(stringr)
str_extract(teststr$job_description,"(\\w+ ){0,5}(\\w+)?@\\w+\\.(com|org|edu|gov)")
# [1] "please send your resumes to adsf@dsf.com" "asdfa@asdf.com"                          
# [3] NA                                         NA                                        
# [5] "follow us on @asdf.gov"                   NA                                        
# [7] NA                                         NA 


grepl("(\\w+ ){0,5}(\\w+)?@\\w+\\.(com|org|edu|gov)",teststr$job_description)
# [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
jasbner
  • 2,253
  • 12
  • 24
  • Keep in mind that this will match email addresses such as `email@emailxcom`. You want to be sure to escape your `.`, but this doesn't really seem any different from my answer. – emsimpson92 Oct 31 '18 at 20:13
  • ahh yes good catch i think it should be `\\.(com|org|edu|gov)")` – jasbner Oct 31 '18 at 20:15
  • @jasbner actually, "follow us on @asdf.gov" should return FALSE (I made a mistake in my question) because the character immediately preceding @ is a space and not a character. How should I revise this answer to make sure that the email address need to have a non-space character that precedes @? – Amazonian Nov 01 '18 at 02:20
  • Just remove the `?` after `(\\w+)`. The question mark indicates optional. – jasbner Nov 01 '18 at 13:08