1

I'm trying to extract two strings from this string using Regular Expressions -

'<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

I want the URL after src and the text after alt (so Organic Chemistry I (as Second Language)) and the url)

I've tried ('<img src=(\w+)" width'), ('<img src="(\w+)"') and ('src="(\w+)"\swidth'), for the url and all return empty.

I've also tried ('alt="(\w+)"') for the name and again, no luck.

Can anyone help?

praks5432
  • 7,246
  • 32
  • 91
  • 156

4 Answers4

3

Use lxml.

import lxml.html

html_string = '<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

img = lxml.html.fromstring(html_string)

print "src:", img.get("src")
print "alt:", img.get("alt")

Gives:

src: http://images.efollett.com/books/978/047/012/9780470129296.gif
alt: Organic Chemistry I (as Second Language)
Acorn
  • 49,061
  • 27
  • 133
  • 172
2

Although you should not be parsing HTML with regexes, I can point out a common error here with regexes, which is your use of \w. That only matches A-Z, a-z, 0-9, and underscores. Not slashes, not parentheses. If you are trying to pull data out of attributes, use "([^"]*)" or "(.*?)"

Community
  • 1
  • 1
Ray Toal
  • 86,166
  • 18
  • 182
  • 232
  • two questions- first how else would I extract the information that I want (I'm using Beautiful Soup and the other form of the above is as a BeautifulSoup tag)? Second, what regex can I use to get what I want? – praks5432 Sep 12 '11 at 07:02
  • 1
    Oh apologies then, I did not know you were using Beautiful Soup, which *is* an HTML parser! There are hints in [this SO question](http://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup). – Ray Toal Sep 12 '11 at 07:05
1

You can try r'<img[^>]*\ssrc="(.*?)"' and r'<img[^>]*\salt="(.*?)"'.

I don't know if you are dealing with HTML. [^>]* is to ensure inside brackets. \s is used to avoid some tags like "xxxsrc", and take care of newlines.

eph
  • 1,988
  • 12
  • 25
0

I don't know python, but may this regular expression helps?

<img.*?src="([^"]*)".*?alt="([^"]*)".*?>
scessor
  • 15,995
  • 4
  • 43
  • 54
  • This works provided the src comes before the alt. Also a tip for efficiency: don't use `.*` in the middle of a regex. `.*?` is more appropriate in this case. – Ray Toal Sep 12 '11 at 07:11
  • Thanks, updated. You're right, only if the string is given as described in the question (`alt` after `src` attribute) this regex makes sense. – scessor Sep 12 '11 at 07:20