python : easy substring/parsing

Question

I have such a string

 <img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>

i need to get src from first img tag

can i do it anyway easy?

Anytime I see HTML, my brain immediately goes to BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html). Check here (http://stackoverflow.com/questions/5815747/beautifulsoup-getting-href) for a similar question. — RocketDonkey, Oct 31 '12 at 21:16
Also http://stackoverflow.com/questions/12937144/image-scraping-program-in-python-not-functioning-as-intended — Vortexfive, Oct 31 '12 at 21:16

score 4 · Accepted Answer · answered Oct 31 '12 at 21:21

4

For HTML screen-scraping in python, I recommend the Beautiful Soup library.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
images = list(soup.findAll('img'))
print images[0]['src']

answered Oct 31 '12 at 21:21

Robert Cooper

1,270
9
11

score 2 · Answer 2 · edited May 23 '17 at 11:48

2

Obligatory "don't parse HTML with regex" warning: https://stackoverflow.com/a/1732454/505154

Evil regex solution:

import re
re.findall(r'<img\s*src="([^"]*)"\s*/>', text)

This will return a list with the src attribute for every <img> tag that only contains a src attribute (since you said you only want to match the first one).

edited May 23 '17 at 11:48

Community

1
1

answered Oct 31 '12 at 21:17

Andrew Clark

202,379
35
273
306

score 0 · Answer 3 · answered Oct 31 '12 at 21:16

0

One way to to this would be to use regex.

Another way is to split the string by quotes and then take the second element that is returned.

splits = your_string.split('"')
print splits[1]

answered Oct 31 '12 at 21:16

Jeff Gortmaker

4,607
3
22
29

score 0 · Answer 4 · answered Oct 31 '12 at 21:34

This is a quick and ugly way to do it without any library:

"""
    >>> get_src(data)
    ['http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg', 'http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo']
"""

data = """<img src="http://www.askgamblers.com/cache/97299a130feb2e59a08a08817daf2c0e6825991f_begado-casino-logo-review1.jpg" /><br/>
 Begado is the newest online casino in our listings. As the newest
 member of the Affactive group, Begado features NuWorks slots and games
 for both US and international players.
<img src="http://feeds.feedburner.com/~r/AskgamblesCasinoNews/~4/SXhvCskjiYo" height="1" width="1"/>"""

def get_src(lines):
    srcs = []
    for line in data.splitlines():
        i = line.find('src=') + 5
        f = line.find('"', i)
        if i > 0 and f > 0:
            srcs.append(line[i:f])
    return srcs

However I would recomend using Beatiful Soup, its a really nice library designed to deal with the real web (broken HTML and all) or you could use Element Tree from Python standard library if your data is valid XML.

python : easy substring/parsing

4 Answers4