Regex to match http:// but not http:// with quotation mark in front

Question

I have this RegEx expression to match http:// links-like part of text:

([A-Za-z]{3,9}):\/\/([-;:&=\+\$,\w]+@{1})?([-A-Za-z0-9\.]+)+:?(\d+)?(\/[-\+~%\/\.\w]+)?\??([-\+=&;%@\.\w]+)?#?([\w]+)?

and later convert them to hyperlinks with some code. It really works good.

However, http:// part of text can be found in < img > tag too:

<img src="http://www.nature.com/images/home_03/main_news_pic2013.02.19.jpg" alt="Pulpit rock" width="304" height="228">

So, I have to modify existing RegEx to NOT match http links-like part of text with quotation mark or apostrophe before. How to NOT match:

"http

I tried with [^"|']:

[^"|']([A-Za-z]{3,9}):\/\/ ..........

but it does not work.

possible duplicate. http://stackoverflow.com/questions/4775840/regular-expressions-to-pull-links-from-html — Scott, Feb 20 '13 at 15:30
Try a negative lookbehind: `(?!<["'])`. But a more reliable approach would be to parse the HTML and then process the text nodes only. After all, there might be all sorts of reasons why a URL is preceded by a quotation mark. — Felix Kling, Feb 20 '13 at 15:30
This is a great example of why you **should not use regular expressions to parse HTML**. If you want to only search text and not tags for URLs, then you use a proper HTML parser to give you only the text and ignore the tags. See http://htmlparsing.com/php.html for examples of how to properly parse HTML with PHP modules. — Andy Lester, Feb 20 '13 at 15:35
I have text as a string. That text may contains `http://some.domen.com` part of text without < a > tag, because user typed it like that. My task is to search for those link-links parts of text and convert them to real hyperlinks (just adding < a > tag ). So, I cannot use DOM to locate them. Am I right? — Branislav, Feb 20 '13 at 17:37

score 2 · Accepted Answer · answered Feb 20 '13 at 15:30

2

You need to use a negative lookbehind (ie. "not preceded by"):

(?<!")http://…

answered Feb 20 '13 at 15:30

Richard

1 Answers1