Python, XPath: Find all links to images

Question

I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:

//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]

There are a couple of problem with this approach:

you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant
in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string

I wanted to use regexp, but I failed:

//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]

This returned me all links all the time ...

Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?

Good question, +1. See my answer for a solution to one of your problems -- finding @href that only ends with a given string. — Dimitre Novatchev, Dec 01 '10 at 21:46
In addition to the other answers describing substrings, you can use the translate function for crude case-conversion. translate(@href, "EGIJFNP", "egijfnp") (all the characters within png, jpeg, gif). — yonran, Dec 02 '10 at 01:53
@yonran I don't know if this is such a good idea, because it will alter also the rest of the URL, not only the extension, and I don't want that — Nicu Surdu, Dec 02 '10 at 12:07

score 2 · Answer 1 · answered Mar 11 '13 at 23:58

lxml supports regular expressions in EXSLT namespace:

from lxml import html

# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html')

# find the first <a href that ends with .png or .jpg or .jpeg ignoring case
ns = {'re': "http://exslt.org/regular-expressions"}
img_url = doc.xpath(r"//a[re:test(@href, '\.(?:png|jpg|jpeg)', 'i')]/@href",
                    namespaces=ns, smart_strings=False)[0]
print(img_url)

score 2 · Answer 2 · answered Dec 01 '10 at 21:14

2

Use XPath to return all <a> elements and use a Python list comprehension to filter down to those matching your regex.

answered Dec 01 '10 at 21:14

Marcelo Cantos

181,030
38
327
365

Maybe it's your syntax. A quick google suggests `fn:matches` instead of `regx:match`. – Marcelo Cantos Dec 01 '10 at 21:19

score 2 · Accepted Answer · edited Jan 23 '15 at 16:45

2

Instead of:

a[contains(@href,'.jpg')]

Use:

a[substring(@href, string-length(@href)-3)='.jpg']

(and the same expression pattern for the other possible endings).

The above expression is the XPath 1.0 equivalent to the following XPath 2.0 expression:

a[ends-with(@href, '.jpg')]

edited Jan 23 '15 at 16:45

Hugo Lopes Tavares

28,528
5
47
45

answered Dec 01 '10 at 21:45

Dimitre Novatchev

240,661
26
293
431

score 1 · Answer 4 · answered Dec 01 '10 at 21:23

Because there's no guarantee that the link has a file extension at all, or that the file extension even matches the content (.jpg URLs returning error HTML, for example) that limits your options.

The only correct way to gather all images from a site would be to get every link and query it with an HTTP HEAD request to find out what Content-type the server is sending for it. If the content type is image/(anything) it's an image, otherwise it's not.

Scraping the URLs for common file extensions is probably going to get you 99.9% of images though. It's not elegant, but neither is most HTML. I recommend being happy to settle for 99.9% in this case. The extra 0.1% isn't worth it.

score 0 · Answer 5 · answered Dec 01 '10 at 21:46

0

Use:

//a[@href[contains('|png|jpg|jpeg|',
                   concat('|',
                          substring-after(substring(.,string-legth()-4),'.'),
                          '|')]]

answered Dec 01 '10 at 21:46

Python, XPath: Find all links to images

5 Answers5

Linked