Regex to match URL / URI except when contained in an img tag

Question

Credit to dfowler's excellent Jabbr project, I am borrowing code to embed linked content from user posts. The code is from here and uses a regex to extract URLs for additional processing and embedding.

In my case, I run the user posts through a markdown processor first, before attempting this embed. The markdown processor (MarkdownDeep) will, if the user formats the markdown correctly, transform any given image markdown into valid HTML img tag. That works great, however, using the embedded content providers will make the image appear twice, since it shows up validly from the markdown transform, then gets embedded as well afterwards.

So, I believe the solution to my problem lies in changing the regex to not match when the found URL is already contained within a valid img tag.

For ease of answering the regex so far is:

(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

I think I want to use negative look-ahead like in this answer to exclude the img, but I'm too poor at regex syntax to implement it myself.

NOTE: I want it to still match images if they just appear in the text. So http://www.example.com/sites/default/files/DellComputer.jpg would match or in a hyperlink <a href='http://www.example.com/sites/default/files/DellComputer.jpg'> would match but <img src='http://www.example.com/sites/default/files/DellComputer.jpg'> would not.

Thanks for the help, I know some of you have savant-level regex talents, I just never could do them.

An image is something with an certain extension or do you want a binary check? — fotanus, May 03 '13 at 15:52
no binary checking. matches URLs regardless of image or not, but excludes URLs if contained in an html img tag. — mlutter, May 03 '13 at 15:55
Process and remove the `img` tag, then match the rest as URLs. Doing too many things in one regex will just makes it unnecessarily complicate to write, debug and maintain. — nhahtdh, May 03 '13 at 15:55
That is one huge regex. What would that match exactly? None of the URLs you supplied matches at least. — melwil, May 03 '13 at 15:58
@nhahtdg, that is probably the better workaround. I'll mess with that if the regex approach fails me. — mlutter, May 03 '13 at 16:02
@melwil see in context of linked github code, the urls are extracted from arbitrary user text, they do match using the c# processor. — mlutter, May 03 '13 at 16:03
@melwil They all match just fine. Perhaps the tool you are using to test doesn't understand `(?i)`? — femtoRgon, May 03 '13 at 16:07
Alright, I failed to observe it was c#. I added a c# tag to the question. — melwil, May 03 '13 at 16:13

femtoRgon · Answer 1 · 2013-05-03T16:10:35.373

1

For the simple approach, just prepend

(?<!img.*)

to the beginning of your regex. It will match as it already does, but will reject it if img comes somewhere before it on the line. So, the entire regex:

(?<!img.*)(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))

Again, not changed except a few characters on the beginning.

If you need it to be smarter about where the img is located on before it on the line, I would probably recommend using a tool other than regex.

edited May 03 '13 at 16:10

answered May 03 '13 at 16:04

femtoRgon

32,893
7
60
87

this excludes `Hey, check out this imgur link: link text` which should match and perform the embed. I think i'm going to have to use your suggestion of another tool... probably @nhahtdh suggestion to strip the valid img tags and then process... – mlutter May 03 '13 at 17:15
Yes, that was my meaning. If you require more intelligence, you should use another tool, probably an xml parser, rather than attempting to parse html with a regex ([obligatory link](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)) – femtoRgon May 03 '13 at 17:34

Regex to match URL / URI except when contained in an img tag

1 Answers1