Extract path and file name from tag

Question

I have source code of some web page, I need to find all occurrences of tag and to extract name and location of that picture (example <img src="../images/test.jpg" /> I need path="../images/" and file="test.jpg"). How can I do that with regular expressions ?

score 4 · Accepted Answer · answered Jan 29 '11 at 12:49

you should use lxml.html

>>> from urllib2 import urlopen
>>> from lxml import html
>>> page = urlopen('http://www.amazon.co.uk/')
>>> page_source = html.parse(page)
>>> from pprint import pprint
>>> pprint(page_source.xpath('//img/@src'))
['http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/uk-marketing/xmas10/janbargains/uk-january-bargains-loz75._V175451391_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-loz-1._V173375114_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-jw/homepage/uk-wtch-police-roto._V185455265_.png',
 'http://g-ecx.images-amazon.com/images/G/02/kindle/shasta/merch/gw/shasta-gw-bestselling-01a-470x265._V173993687_.jpg',
 'http://ecx.images-amazon.com/images/I/412wF8LJ-uL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51YC5H64AuL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/41%2BdpTvM1FL._SL135_.jpg',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V42752373_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://ecx.images-amazon.com/images/I/51-kiOR0NwL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51DRc-7HuxL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51SK5htD22L._SL135_.jpg',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://ecx.images-amazon.com/images/I/31POT%2BzL1tL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/41hkDkhjrTL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/41zDYiAWasL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/31HqB5H8j%2BL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/uk-clothing/Lingerie/UK_APP_LingerieStore_50._V171062881_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-pets/graphics/B000FVC1HE_50._V198692831_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/uk-grocery/images/illy_50._V198779066_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/uk-electronics/MI_Store/UK_MIN_MILaunch_50._V191178779_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-lighting/graphics/NoveltyLighting_50._V192237013_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-TCG-1._V173375108_.png',
 'http://g-ecx.images-amazon.com/images/G/02/gno/images/general/navAmazonLogoFooter._V192252709_.gif']

score 3 · Answer 2 · edited May 23 '17 at 09:58

3

You shouldn't use regular expressions to parse HTML for the various reasons outlined in this answer. You should use an HTML parser.

edited May 23 '17 at 09:58

Community

1
1

answered Jan 29 '11 at 12:49

Darin Dimitrov

1,023,142
271
3,287
2,928

score 0 · Answer 3 · answered Jan 29 '11 at 12:50

There are a number of ways, you can use capturing groups

path=("[^"]+")

or lookbehind syntax

(?<=path=)"[^"]+"

There are probably a bunch of other alternatives too. Either way you should as the previous poster mentioned probably use an HTML parser for the job. Still, if you use regex, you probably need to first extract the img tags, then run one of the regex's above.

Extract path and file name from tag

3 Answers3