Extracting image paths

Question

I need to extract all images from an HTML, not just from <img> tags but from anywhere, including relative paths. I tried this regex:

([a-z\-_0-9\/\:\.]*\.(jpg|jpeg|png|gif))

.. but it fails on encountering special chars. Like in this case for example.

How do I grab the path so that it starts from either ' (single quote), " (double quote) or /, no spaces in between and ends with image extension jpg|jpeg|png|gif?

Edit: I use DOM parser where possible, but I must use regex here to extract paths from just about everywhere, including inline CSS and JS.

You should *never* parse HTML with regex. Use [a PHP DOM parser](http://simplehtmldom.sourceforge.net/) instead. — Jay Blanchard, Dec 19 '16 at 21:01
Thanks for the suggestion, but I do understand that and must use regex. I'm mining data, not parsing. — eozzy, Dec 19 '16 at 21:03
You could exploit lookbehind and try something like `((?<='|")[^'"\s]*\.(jpg|jpeg|png|gif))`, which accepts anything directly after a quotation mark (single or double), contains only not-{whitespace,quotation mark} and ends with one of your extensions. — , Dec 19 '16 at 21:05
I don't understand why this was closed? Just too many overzealous mods now at SO! I'm **mining** data, not parsing HTML, and regex is pretty much what I can and should use! — eozzy, Dec 19 '16 at 21:06
@Dagon Alright so tell me how do you extract image path from this with an HTML parser: `` — eozzy, Dec 19 '16 at 21:08
@JayBlanchard - While there is quite the hatred for RegEx for parsing (mining is parsing as well), it has it's use where the source is any of the following : Malformed html, Not html, Is Html in an unknown structure where the XPath to specified elements are unknown, you wish to extract data after all JQuery / etc have executed via a random timer after page has completely loaded. — Kraang Prime, Dec 19 '16 at 21:09
@3zzy - I would vote to reopen, but that's not even an option for me :S . I understand what you want help with is tuning your regex pattern, but in recent years seems that RegEx has become a swear word among parsing because "DOM" is better ... "DOM" is faster. DOM, has it's use, but is overrated. RegEx is a parsing language-- DOM, is a RENDERING language. It's like waiting for an MP3 to load before reading the header for meta-data. — Kraang Prime, Dec 19 '16 at 21:12
@SamuelJackson Precisely, and I do use DOM parser for the rest of the stuff but RegEx is the only option here. — eozzy, Dec 19 '16 at 21:13
I am aware of that @SamuelJackson and there is no hatred from me on the subject. The OP did not make his intentions clear from the outset and even the assertion of data mining doesn't clarify what the attempt is here. Given what was asked we could never have known if the OP was dealing with the things you listed and therefore he should've included the information in his comment above concerning the JavaScript function. Had he done that I would've never posted the comment and it is possible that CamilStaps could've posted an answer and earned the rep from it. — Jay Blanchard, Dec 19 '16 at 21:30
@JayBlanchard fair enough. The remark about hatred for regex wasn't specific to you, just something I have noticed a trend in - many of those are on the bandwagon for the joy-ride pitchforks and all. I do very little DOM work as most things I work with have a direct API, or are too random to have any long term benefit from coding the parser through DOM. OP has updated their question. — Kraang Prime, Dec 19 '16 at 21:36
I have reopened the question. @CamilStaps you should post your comment as an answer. — Jay Blanchard, Dec 19 '16 at 21:38

score 2 · Accepted Answer · answered Dec 19 '16 at 21:54

2

You could use lookbehind:

(?<=['"])[^'"\s]*\.(jpg|jpeg|png|gif)

This parses any URL that does not contain quotation marks or whitespace and is preceded by a quotation mark.

The (minor) advantage of using lookbehind over matching the quotation mark as well is that this way, you can use the entire match directly and don't have to strip off the quotation mark in postprocessing. Lookbehind is not supported by all regex libraries because of complexity reasons, however, in this case it is not slower than the alternative.

answered Dec 19 '16 at 21:54

Your explanation confuses me a little--depending on how you structure your capture group, you certainly won't have to handle the initial character in post-processing. Or are you saying your solution doesn't require using a capture group? – Nathan Arthur Dec 19 '16 at 21:57
1

@NathanArthur PHP's library will return the entire match and also matching groups, so in this case it will return an array of two elements: the entire string (without quotation marks) and the extension. Your solution will return an array of three elements: the whole string with quotation mark, the whole string without quotation mark, and the extension. – Dec 19 '16 at 21:58

Nathan Arthur · Answer 2 · 2016-12-19T21:59:55.500

1

This works on your test data:

['"\/]([^\s'"]+?\.(jpg|jpeg|png|gif))

It starts by requiring a single quote, double quote or forward slash, and then captures everything but white space, single quotes, and double quotes, up to the nearest image extension. Matches are stored in your first capture group (often $1).

This solution has the advantage (or perhaps disadvantage) of not requiring lookbehinds.

edited Dec 19 '16 at 21:59

answered Dec 19 '16 at 21:50

Nathan Arthur

8,287
7
55
80

Why did you remove the dot before the extension and change the quantifier to `+?`? – Dec 19 '16 at 21:57
@CamilStaps I didn't intend to remove the dot. I built the pattern from scratch. I'll edit my answer. I feel that using lazy matching is safer, and that OP probably doesn't want to match empty URLs. – Nathan Arthur Dec 19 '16 at 21:59
1

Considering that in the sample data the string is also ended with a quotation mark, the quantifier won't really make a difference. I was just interested, if you built it from scratch I understand. – Dec 19 '16 at 22:00

Extracting image paths

2 Answers2