Credit to dfowler's excellent Jabbr project, I am borrowing code to embed linked content from user posts. The code is from here and uses a regex to extract URLs for additional processing and embedding.
In my case, I run the user posts through a markdown processor first, before attempting this embed. The markdown processor (MarkdownDeep) will, if the user formats the markdown correctly, transform any given image markdown into valid HTML img tag. That works great, however, using the embedded content providers will make the image appear twice, since it shows up validly from the markdown transform, then gets embedded as well afterwards.
So, I believe the solution to my problem lies in changing the regex to not match when the found URL is already contained within a valid img tag.
For ease of answering the regex so far is:
(?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'"".,<>?«»“”‘’]))
I think I want to use negative look-ahead like in this answer to exclude the img, but I'm too poor at regex syntax to implement it myself.
NOTE: I want it to still match images if they just appear in the text. So http://www.example.com/sites/default/files/DellComputer.jpg
would match
or in a hyperlink <a href='http://www.example.com/sites/default/files/DellComputer.jpg'>
would match but <img src='http://www.example.com/sites/default/files/DellComputer.jpg'>
would not.
Thanks for the help, I know some of you have savant-level regex talents, I just never could do them.