0

I have such a string

 <img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>

i need to get src from first img tag

can i do it anyway easy?

yital9
  • 6,544
  • 15
  • 41
  • 54
  • 2
    Anytime I see HTML, my brain immediately goes to BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). Check here (http://stackoverflow.com/questions/5815747/beautifulsoup-getting-href) for a similar question. – RocketDonkey Oct 31 '12 at 21:16
  • Also http://stackoverflow.com/questions/12937144/image-scraping-program-in-python-not-functioning-as-intended – Vortexfive Oct 31 '12 at 21:16

4 Answers4

4

For HTML screen-scraping in python, I recommend the Beautiful Soup library.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = list(soup.findAll('img'))
print images[0]['src']
Robert Cooper
  • 1,270
  • 9
  • 11
2

Obligatory "don't parse HTML with regex" warning: https://stackoverflow.com/a/1732454/505154

Evil regex solution:

import re
re.findall(r'<img\s*src="([^"]*)"\s*/>', text)

This will return a list with the src attribute for every <img> tag that only contains a src attribute (since you said you only want to match the first one).

Community
  • 1
  • 1
Andrew Clark
  • 202,379
  • 35
  • 273
  • 306
0

One way to to this would be to use regex.

Another way is to split the string by quotes and then take the second element that is returned.

splits = your_string.split('"')
print splits[1]
Jeff Gortmaker
  • 4,607
  • 3
  • 22
  • 29
0

This is a quick and ugly way to do it without any library:

"""
    >>> get_src(data)
    ['http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg', 'http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo']
"""

data = """<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>"""

def get_src(lines):
    srcs = []
    for line in data.splitlines():
        i = line.find('src=') + 5
        f = line.find('"', i)
        if i > 0 and f > 0:
            srcs.append(line[i:f])
    return srcs

However I would recomend using Beatiful Soup, its a really nice library designed to deal with the real web (broken HTML and all) or you could use Element Tree from Python standard library if your data is valid XML.

Facundo Casco
  • 10,065
  • 8
  • 42
  • 63