0

I'm crawling through an HTML page and I want to extract the img srcs and the a hrefs.

On the particular site, all of them are encapsulated in double quotes.

I've tried a wide variety of regexps with no success. Assume characters inside the double-quotes will be [-\w/] (printable characters [a-zA-Z\d-_] and / and .)

In python:

re.search(r'img\s+src="(?P<src>[\w-/]+_"', line)

Doesn't return anything, but

re.search(r'img\s+src="(?P[-\w[/]]+)"', line)

Returns wayy to much (i.e., does not stop at the " ).

I need help creating the right regexp. Thanks in advance!

Bill
  • 2,319
  • 9
  • 29
  • 36
  • 1
    Obligatory: http://stackoverflow.com/a/1732454/350351 – Daenyth Apr 27 '12 at 15:54
  • True, can't parse html with regexes, but you can find certain things inside it, and for quick scripts etc. it may be the right tool. – OlliM Apr 27 '12 at 15:58
  • @Daenyth, yes, I know that. I've tutored many people on the pumping lemma for regular and context-free grammars. The regexp I'm trying to find is simply a field inside of a tag, which is most certainly regular. – Bill Apr 27 '12 at 16:04
  • 1
    @B.VB.: Regardless, *not* using regex is much easier. See my answer. – Daenyth Apr 27 '12 at 16:08
  • 1
    @B.VB., no, because that `` could be inside a ``. Or a string inside a ` – josh3736 Apr 27 '12 at 16:09
  • I don't see where I claim that. It is most certainly not true. I needed help composing the proper regexp to parse the quote-encapsulated fields of an img tag. – Bill Apr 27 '12 at 16:15
  • @B.VB.: Sorry I misread your comments. OK, can you please explain to us why you need to use a regex based solution instead of using an HTML parser? – Mark Byers Apr 27 '12 at 16:20
  • Considering I upvoted your answer, I think I certainly will consider using the HTML parser you suggested if my project gets any more sophisticated. For something as simple as this though, hopefully a robust regexp is all i need. – Bill Apr 27 '12 at 16:30

3 Answers3

6

I need help creating the right regexp.

No, you need help in finding the right tool.

Try BeautifulSoup.

(If you insist on using regular expressions - and I'd advise against it - try changing the greedy + to non-greedy +?).

Kate Gregory
  • 18,808
  • 8
  • 56
  • 85
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
5

Here's an example of a better way to do it than with regex, using the excellent lxml library and xpath


In [1]: import lxml.html
In [2]: doc = lxml.html.parse('http://www.google.com/search?q=kittens&tbm=isch')
In [3]: doc.xpath('//img/@src')
Out[3]: 
['/images/nav_logo_hp2.png',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcQhajNZimPGLw9iTfzrAF_HV5UogY-KGep5WYgw-VHZ15oaAwGquNb5Q2I',
 'http://t2.gstatic.com/images?q=tbn:ANd9GcS1LgVIlDgoIfNzwU4xBz9fL32ZJjZU26aB4aynRsEcz2VuXmjCtvxUonM',
 'http://t1.gstatic.com/images?q=tbn:ANd9GcRgouJt5Moe8uTnDPUFTo4csZOcBtEDA_B7WdRPe8pdZroR5QB2q_-LT59G',
 [...]
]
Daenyth
  • 35,856
  • 13
  • 85
  • 124
2

A good trick for finding things inside quotes you do "([^"]+)". So you search for any characters but the quote that are between quotes.

For help with creating regular expressions I can strongly recommend Expresso ( http://www.ultrapico.com/Expresso.htm )

OlliM
  • 7,023
  • 1
  • 36
  • 47