4

How to use regular expression to get src of image from the following html string using Python

<td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?sa=t&amp;fd=R&amp;usg=AFQjCNFqz8ZCIf6NjgPPiTd2LIrByKYLWA&amp;url=http://www.news.com.au/business/spain-victory-faces-market-test/story-fn7mjon9-1226390697278"><img src="//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg" alt="" border="1" width="80" height="80" /><br /><font size="-2">NEWS.com.au</font></a></font></td>

I tried to use

matches = re.search('@src="([^"]+)"',text)
print(matches[0])

But got nothing

Jeff Tratner
  • 16,270
  • 4
  • 47
  • 67
Don Li
  • 43
  • 1
  • 1
  • 3

3 Answers3

9

Instead of regex, you could consider using BeautifulSoup:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(junk)
>>> soup.findAll('img')
[<img src="//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg" alt="" border="1" width="80" height="80" />]
>>> soup.findAll('img')[0]['src']
u'//nt3.ggpht.com/news/tbn/380jt5xHH6l_FM/6.jpg'
clopez
  • 4,372
  • 3
  • 28
  • 42
fraxel
  • 34,470
  • 11
  • 98
  • 102
  • 1
    wouldn't Beautiful Soup add a lot of overhead to the solution? `img` tags are relatively easy to parse (and since they don't enclose other text, usually are formatted correctly) – Jeff Tratner Jun 11 '12 at 15:21
6

Just lose the @ in the regex and it will work

xpda
  • 15,585
  • 8
  • 51
  • 82
buckley
  • 13,690
  • 3
  • 53
  • 61
-1

You could simplify your re a little:

match = re.search(r'src="(.*?)"', text)
Joel Cornett
  • 24,192
  • 9
  • 66
  • 88