0

I have this RegEx expression to match http:// links-like part of text:

([A-Za-z]{3,9}):\/\/([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?(\/[-\+~%\/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?

and later convert them to hyperlinks with some code. It really works good.

However, http:// part of text can be found in < img > tag too:

<img src="http://www.nature.com/images/home_03/main_news_pic2013.02.19.jpg" alt="Pulpit rock" width="304" height="228">

So, I have to modify existing RegEx to NOT match http links-like part of text with quotation mark or apostrophe before. How to NOT match:

"http

I tried with [^"|']:

[^"|']([A-Za-z]{3,9}):\/\/ ..........

but it does not work.

Andy Lester
  • 91,102
  • 13
  • 100
  • 152
Branislav
  • 315
  • 1
  • 3
  • 13
  • 1
    possible duplicate. http://stackoverflow.com/questions/4775840/regular-expressions-to-pull-links-from-html – Scott Feb 20 '13 at 15:30
  • 2
    Try a negative lookbehind: `(?!<["'])`. But a more reliable approach would be to parse the HTML and then process the text nodes only. After all, there might be all sorts of reasons why a URL is preceded by a quotation mark. – Felix Kling Feb 20 '13 at 15:30
  • 3
    This is a great example of why you **should not use regular expressions to parse HTML**. If you want to only search text and not tags for URLs, then you use a proper HTML parser to give you only the text and ignore the tags. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. – Andy Lester Feb 20 '13 at 15:35
  • I have text as a string. That text may contains `http://some.domen.com` part of text without < a > tag, because user typed it like that. My task is to search for those link-links parts of text and convert them to real hyperlinks (just adding < a > tag ). So, I cannot use DOM to locate them. Am I right? – Branislav Feb 20 '13 at 17:37

1 Answers1

2

You need to use a negative lookbehind (ie. "not preceded by"):

(?<!")http://…
Richard
  • 106,783
  • 21
  • 203
  • 265