0

I want to use regex in ruby to capture the plain text email address but NOT the email address surrounded by mailto link tags (like <a href="" class="" >a@b.com</a>), tried source.gsub(/(?!<$)[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}/i) but this does not work

user94559
  • 59,196
  • 6
  • 103
  • 103
shawn
  • 1
  • Are you parsing HTML? Why not use an HTML parser and simply grab the text content of that element? – Mark Thomas Jun 07 '17 at 02:21
  • Are you aware that you [can't parse HTML with a regexp](https://stackoverflow.com/a/1732454/2483313), because HTML isn't a regular language? – spickermann Jun 07 '17 at 06:32

1 Answers1

0

/(\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-])(?!<\/a>)*$/i

So something like source.gsub(/(\w+@[\w.-]+|\{(?:\w+, *)+\w+\}@[\w.-])(?!<\/a>)*$/i)

Here's the regex statement, I took the liberty of using a different e-mail selector as well.

(?!<\/a>)*$ basically says ignore it if this ends in </a>. It might be more efficient though to just filter out any <a></a> tags first if you're expecting multiple email addresses per line / document.

OneNeptune
  • 883
  • 11
  • 20
  • Thank for your answer, unfortunately, the regex doesn't capture either email address end with tag or the plain text one... – shawn Jun 07 '17 at 01:50