R regular expression question: I have a data.frame of job title and job descriptions and I need to
1) check if a job description contains an email address (can be .org, .edu, .gov, .com), and
2) extract the email address and the 5 words that precedes the email address
The dataset can contain web urls which can end in .edu, .com, etc. and also contains returns. Basically I was hoping to identify email address as anything that has [letters/numbers]@[letters/numbers](.org, .edu, .gov, .com, and whatever else an email can end in)
Here is a sample dataset:
teststr = data.frame(job_title = c(1:8),
job_description = c('please send your resumes to adsf@dsf.com apply now!',
'asdfa@asdf.com/adsf asdf',
'visit us at sfds@adfa',
'apply now',
'follow us on @asdf.gov',
'asdfa.gov',
'.com',
''))
> teststr
job_title job_description
1 1 please send your resumes to adsf@dsf.com apply now!
2 2 asdfa@asdf.com/adsf asdf
3 3 visit us at sfds@adfa
4 4 apply now
5 5 follow us on @asdf.gov
6 6 asdfa.gov
7 7 .com
8 8
I attempted at (1), but got the wrong answer
grepl('(*@.+\\.com)|(*@\\S\\.gov)', teststr$job_description)
The correct result to (1) should be
TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE