Regexp to parse HTML imgs

Question

I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.

On the particular site, all of them are encapsulated in double quotes.

I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)

In python:

re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)

Doesn't return anything, but

re.search(r'img\s+src="(?P[-\w[/]]+)"', line)

Returns wayy to much (i.e., does not stop at the " ).

I need help creating the right regexp. Thanks in advance!

True, can't parse html with regexes, but you can find certain things inside it, and for quick scripts etc. it may be the right tool. — OlliM, Apr 27 '12 at 15:58
@Daenyth, yes, I know that. I've tutored many people on the pumping lemma for regular and context-free grammars. The regexp I'm trying to find is simply a field inside of a tag, which is most certainly regular. — Bill, Apr 27 '12 at 16:04
@B.VB.: Regardless, *not* using regex is much easier. See my answer. — Daenyth, Apr 27 '12 at 16:08
@B.VB., no, because that `` could be inside a ``. Or a string inside a ` — josh3736, Apr 27 '12 at 16:09
I don't see where I claim that. It is most certainly not true. I needed help composing the proper regexp to parse the quote-encapsulated fields of an img tag. — Bill, Apr 27 '12 at 16:15
@B.VB.: Sorry I misread your comments. OK, can you please explain to us why you need to use a regex based solution instead of using an HTML parser? — Mark Byers, Apr 27 '12 at 16:20
Considering I upvoted your answer, I think I certainly will consider using the HTML parser you suggested if my project gets any more sophisticated. For something as simple as this though, hopefully a robust regexp is all i need. — Bill, Apr 27 '12 at 16:30

score 6 · Answer 1 · edited Sep 30 '12 at 14:17

6

I need help creating the right regexp.

No, you need help in finding the right tool.

Try BeautifulSoup.

_{(If you insist on using regular expressions - and I'd advise against it - try changing the greedy + to non-greedy +?).}

edited Sep 30 '12 at 14:17

Kate Gregory

18,808
8
56
85

answered Apr 27 '12 at 15:53

Mark Byers

811,555
193
1,581
1,452

score 5 · Answer 2 · answered Apr 27 '12 at 16:06

Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath

In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/@src')
Out[3]: 
['/images/nav_logo_hp2.png',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
 'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
 [...]
]

OlliM · Accepted Answer · 2012-04-27T16:02:18.367

2

A good trick for finding things inside quotes you do "([^"]+)". So you search for any characters but the quote that are between quotes.

For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )

edited Apr 27 '12 at 16:02

answered Apr 27 '12 at 15:55

OlliM

7,023
1
36
47

Regexp to parse HTML imgs

3 Answers3