Python Regex String Extraction

Question

I'm trying to extract two strings from this string using Regular Expressions -

'<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

I want the URL after src and the text after alt (so Organic Chemistry I (as Second Language)) and the url)

I've tried ('<img src=(\w+)" width'), ('<img src="(\w+)"') and ('src="(\w+)"\swidth'), for the url and all return empty.

I've also tried ('alt="(\w+)"') for the name and again, no luck.

Can anyone help?

score 3 · Answer 1 · answered Sep 12 '11 at 10:14

Use lxml.

import lxml.html

html_string = '<img src="http://images.efollett.com/books/978/047/012/9780470129296.gif" width="80" height="100" alt="Organic Chemistry I (as Second Language)" />'

img = lxml.html.fromstring(html_string)

print "src:", img.get("src")
print "alt:", img.get("alt")

Gives:

src: http://images.efollett.com/books/978/047/012/9780470129296.gif
alt: Organic Chemistry I (as Second Language)

score 2 · Answer 2 · edited May 23 '17 at 12:31

2

Although you should not be parsing HTML with regexes, I can point out a common error here with regexes, which is your use of \w. That only matches A-Z, a-z, 0-9, and underscores. Not slashes, not parentheses. If you are trying to pull data out of attributes, use "([^"]*)" or "(.*?)"

edited May 23 '17 at 12:31

Community

1
1

answered Sep 12 '11 at 06:58

Ray Toal

86,166
18
182
232

two questions- first how else would I extract the information that I want (I'm using Beautiful Soup and the other form of the above is as a BeautifulSoup tag)? Second, what regex can I use to get what I want? – praks5432 Sep 12 '11 at 07:02
1

Oh apologies then, I did not know you were using Beautiful Soup, which *is* an HTML parser! There are hints in [this SO question](http://stackoverflow.com/questions/2612548/extracting-an-attribute-value-with-beautifulsoup). – Ray Toal Sep 12 '11 at 07:05

eph · Accepted Answer · 2011-09-12T07:10:04.430

1

You can try r'<img[^>]*\ssrc="(.*?)"' and r'<img[^>]*\salt="(.*?)"'.

I don't know if you are dealing with HTML. [^>]* is to ensure inside brackets. \s is used to avoid some tags like "xxxsrc", and take care of newlines.

edited Sep 12 '11 at 07:10

answered Sep 12 '11 at 07:03

eph

1,988
12
25

This works but backtracks. Probably okay for small img tags. +1 for correctness. – Ray Toal Sep 12 '11 at 07:14

scessor · Answer 4 · 2011-09-12T07:26:55.943

0

I don't know python, but may this regular expression helps?

<img.*?src="([^"]*)".*?alt="([^"]*)".*?>

edited Sep 12 '11 at 07:26

answered Sep 12 '11 at 07:02

scessor

15,995
4
43
54

This works provided the src comes before the alt. Also a tip for efficiency: don't use `.*` in the middle of a regex. `.*?` is more appropriate in this case. – Ray Toal Sep 12 '11 at 07:11
Thanks, updated. You're right, only if the string is given as described in the question (`alt` after `src` attribute) this regex makes sense. – scessor Sep 12 '11 at 07:20

Python Regex String Extraction

4 Answers4

Linked