I'm using lxml in Python to parse some HTML and I want to extract all link to images. The way I do it right now is:
//a[contains(@href,'.jpg') or contains(@href,'.jpeg') or ... (etc)]
There are a couple of problem with this approach:
- you have to list all possible image extensions in all cases (both "jpg" and "JPG"), wich is not elegant
- in a weird situations, the href may contain .jpg somewhere in the middle, not at the end of the string
I wanted to use regexp, but I failed:
//a[regx:match(@href,'.*\.(?:png|jpg|jpeg)')]
This returned me all links all the time ...
Does anyone knows the right, elegant way to do this or what is wrong with my regexp approach ?