1

I have source code of some web page, I need to find all occurrences of tag and to extract name and location of that picture (example <img src="../images/test.jpg" /> I need path="../images/" and file="test.jpg"). How can I do that with regular expressions ?

martineau
  • 119,623
  • 25
  • 170
  • 301
Damir
  • 54,277
  • 94
  • 246
  • 365

3 Answers3

4

you should use lxml.html

>>> from urllib2 import urlopen
>>> from lxml import html
>>> page = urlopen('http://www.amazon.co.uk/')
>>> page_source = html.parse(page)
>>> from pprint import pprint
>>> pprint(page_source.xpath('//img/@src'))
['http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/uk-marketing/xmas10/janbargains/uk-january-bargains-loz75._V175451391_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-loz-1._V173375114_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-jw/homepage/uk-wtch-police-roto._V185455265_.png',
 'http://g-ecx.images-amazon.com/images/G/02/kindle/shasta/merch/gw/shasta-gw-bestselling-01a-470x265._V173993687_.jpg',
 'http://ecx.images-amazon.com/images/I/412wF8LJ-uL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51YC5H64AuL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/41%2BdpTvM1FL._SL135_.jpg',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V42752373_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
 'http://ecx.images-amazon.com/images/I/51-kiOR0NwL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51DRc-7HuxL._SL135_.jpg',
 'http://ecx.images-amazon.com/images/I/51SK5htD22L._SL135_.jpg',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
 'http://ecx.images-amazon.com/images/I/31POT%2BzL1tL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/41hkDkhjrTL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/41zDYiAWasL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://ecx.images-amazon.com/images/I/31HqB5H8j%2BL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/uk-clothing/Lingerie/UK_APP_LingerieStore_50._V171062881_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-pets/graphics/B000FVC1HE_50._V198692831_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/uk-grocery/images/illy_50._V198779066_.gif',
 'http://g-ecx.images-amazon.com/images/G/02/uk-electronics/MI_Store/UK_MIN_MILaunch_50._V191178779_.png',
 'http://g-ecx.images-amazon.com/images/G/02/uk-lighting/graphics/NoveltyLighting_50._V192237013_.jpg',
 'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-TCG-1._V173375108_.png',
 'http://g-ecx.images-amazon.com/images/G/02/gno/images/general/navAmazonLogoFooter._V192252709_.gif']
virhilo
  • 6,568
  • 2
  • 29
  • 26
3

You shouldn't use regular expressions to parse HTML for the various reasons outlined in this answer. You should use an HTML parser.

Community
  • 1
  • 1
Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
0

There are a number of ways, you can use capturing groups

path=("[^"]+")

or lookbehind syntax

(?<=path=)"[^"]+" 

There are probably a bunch of other alternatives too. Either way you should as the previous poster mentioned probably use an HTML parser for the job. Still, if you use regex, you probably need to first extract the img tags, then run one of the regex's above.

Johan Sjöberg
  • 47,929
  • 21
  • 130
  • 148