0

I'm trying to find email within text that is being followed by a key word 'email'

preg_match_all("/(email<=)(\S+@\S+)/im", $input_lines, $output_array);

my input data is

here is some text
that does not hi_there@welcome.co
but this email should be captured yes@well.com

so the email in the 3rd line should be captured

Theo
  • 1,608
  • 1
  • 9
  • 16
Kal
  • 948
  • 17
  • 30
  • 1
    What keyword are you looking for? - there is no obvious reason in your question why line 3 would match but line 2 would not. – Theo Feb 23 '17 at 23:09
  • The word 'email' should exist ahead of the email address on the same line. updated question – Kal Feb 23 '17 at 23:11

1 Answers1

2

The regex: /email.+?\b(\S+@\S+)/i

Working Example

in php:

preg_match_all("/email.+?\b(\S+@\S+)/i", $input_lines, $output_array);

$output_array[1] will now contain your email addresses

I removed the m flag - as this changes the way $ and ^ work which are not being used.

The breakdown of this is as follows:

  • email this just matches the text email- the i after the final / makes it not care about case
  • .+? will match any character other than new line one or more times, but matching as few characters as possible see Regex Laziness
  • \b will match a word boundary - that is between a word character and a non word character - see Word boundaries
  • ( starts a capturing group - because this is the first one, this is why it is found in $output_array[1], if you had a second any matches would be in $output_array[2]
    • \S+ matches anything that isn't whitespace one or more times
    • @ matches the '@' character
    • \S+ matches anything that isn't whitespace one or more times
  • ) this closes the capturing group

We could start a huge debate over whether \S@\S is the best way to match an email address - I think its good enough for this purpose, but if you want to go down the rabbit hole see here: Using a regular expression to validate an email address

Community
  • 1
  • 1
Theo
  • 1,608
  • 1
  • 9
  • 16
  • Yes sorry, I have now corrected - I will edit further with an explanation of how this works – Theo Feb 23 '17 at 23:20
  • 1
    @Theo great answer. I've never used \b before, now I know how to use it. I visited your working demo link, and replaced `\b` with `\s` which yields the same result, but `\s` does so in 53 steps and `\b` uses 86 steps. This isn't criticism, just an observation. For this small sample, the time/step impact is unnoticeable, but perhaps if Kal is doing an enormous string, this nano-optimization may be valued. Just wanted to mention it. – mickmackusa Feb 23 '17 at 23:43
  • 1
    @mickmackusa - that's a very good point \s is probably more suitable here - I will edit my answer and credit you, but leave my version with \b in place - as I do like to teach people rather than just write their regex for them ;) – Theo Feb 23 '17 at 23:46
  • @mickmackusa - as I was writing the update I realised that `\s` also matches newlines so it would not be good here - I am suprised that regex101's implementation doesn't - A working comparison of the two in php can be seen here: http://sandbox.onlinephpfunctions.com/code/5d0a52705495f8ebd13a40b5d2d376b4b3c6a311 – Theo Feb 23 '17 at 23:57
  • @Theo The sample in the question makes no mention of capturing an email address that is on a separate line from the email keyword. In your new/different case, I agree \s will fail. – mickmackusa Feb 24 '17 at 00:06
  • @mickmackusa I realise they do not want to match email addresses on separate lines, this was to illustrate that `\s` breaks the fact that the regex will only match where keyword is on the same line as the email - if we can guarantee that the word email would only appear on lines that subsequently have an email address then `\s` would work, but if there is any chance that the keyword could appear within a line without an email - then `\s` could make it capture email addresses that do not have the keyword present. – Theo Feb 24 '17 at 00:10
  • @Theo ...rabbit holes abound in regex pattern creation. Every coder needs to decide for themselves when enough is enough. – mickmackusa Feb 24 '17 at 00:17
  • @Theo how would you deal with the case if the email is like `hello-world-there@domain.com`? – Kal Mar 01 '17 at 21:45
  • @Kai I have updated the answer to make it 'lazy' which fixes this issue – Theo Mar 02 '17 at 19:59