2

I have a string

<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />

What is the Regex to find ABCDXYZ in Python

John
  • 3,888
  • 11
  • 46
  • 84

3 Answers3

5

Don't use regex to parse HTML. Use BeautifulSoup.

from bs4 import BeautifulSoup as BS
text = '''<a href="/p/123411/"><img src="/p_img/411/123411/639469aa9f_123411_100.jpg" alt="ABCDXYZ" />'''
soup = BS(text)
print soup.find('img').attrs['alt']
jdotjdot
  • 16,134
  • 13
  • 66
  • 118
3

If you're looking for the value of that alt attribute, you can do this:

>>> r = r'alt="(.*?)"'

Then:

>>> m = re.search(r, mystring)
>>> m.group(1)
'ABCDXYZ'

And you can use re.findall if you want to find more than one.

However, this code will be easily fooled by something like this:

<span>Here's some text explaining how to do alt="foo" in an img tag.</span>

On the other hand, it'll also fail to pick up something like this:

<img src='/p_img/411/123411/639469aa9f_123411_100.jpg' alt='ABCDXYZ' />

How do you deal with that? The short answer is: You don't. XML and HTML are not regular languages.

It's worth backing up here to point out that Python's re engine is not actually a true regular expression engine—and, on top of that, it's embedded in a Turing-complete programming language. So obviously it is possible to build an HTML parser around Python and re. This answer shows part of a parser written in perl, where regexes do most of the heavy lifting. But that doesn't mean you should do it this way. You shouldn't be writing a parser in the first place, given that perfectly good ones already exist, and if you did, you shouldn't be forcing yourself to use regexes even when there's an easier way to do what you want. For quick&dirty playing around, regex is fine. For a production program, it's almost always the wrong answer.

One way to convince your boss to let you use a parser is by crafting a suite of tests that are all obviously valid, and that cannot possibly be handled by any regex-based solution short of a full parser. If you can come up with a test that can be parsed, but only using exponential backtracking, and therefore takes 12 hours with regex vs. 0.1 seconds with bs4, even better, but that's a bit trickier…

Of course it's also worth looking for articles online (and SO questions like this and this and the 300 other dups) and picking the best ones to show your boss.

If you really can't convince your boss otherwise, then you're done at this point. Given what's been specified, this works. Given what may or may not actually be intended, nothing short of mind-reading will work. As you find more and more real-life cases that fail, you can hack it up by adding more and more complex alternations and/or context onto the regex itself, or possibly use a series of regexes and post-filters, until finally you get sick of it and find yourself a better job.

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
  • thanks. I actually succeeded with parser, but my boss wants me to use regular expression. – John Jan 07 '13 at 05:14
  • Unless there's some reason I can't see here, your boss is very misguided in asking you to use a regular expression. There's a reason parsers exist. – jdotjdot Jan 07 '13 at 05:15
  • I would advise that you tell your boss that he really doesnt know what he is talking about if he tells you to use regex. See [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Amelia Jan 07 '13 at 05:15
  • @John: Too bad you can't fire your boss. :) But maybe you can explain to him why it's impossible. Create some test input that cannot possibly be parsed by any regular expression, explain to him why the code he made you write is wrong, and convince him to let you do it right. (This is one of the many nice things about test-driven development. It's a lot harder to argue with a failing test that's obviously valid than with someone telling you "HTML isn't a regular language".) – abarnert Jan 07 '13 at 05:15
  • As it is already, abarnert, your regex doesn't cover all cases, since maybe the strings are enclosed in single quotes. Really, @John, convince him regex is not the answer. If speed is an issue, use `lxml` instead of BeautifulSoup. – jdotjdot Jan 07 '13 at 05:18
  • @jdotjdot: Given the horribly underspecified problem statement, even ignoring the fact that the problem would likely become impossible if it were properly specified, it seems the best answer. Of course it'll probably turn out to be both too strict in some ways and not strict enough in others… but with just one line as the only input, it handles that one line properly. My answer already explained that, but I'll try to make it more explicit. – abarnert Jan 07 '13 at 05:22
  • @abarnert Sorry, my comment wasn't clear--I totally agree with you. I was trying to point out to John that even your otherwise very good answer right off the bat has major problems, and that's because regex for HTML sucks. Was definitely not criticizing you. – jdotjdot Jan 07 '13 at 05:25
  • 1
    @jdotjdot: Well, I didn't think you were criticizing me, just criticizing my answer. And I think that putting your example into my answer made it better—if so, the criticism was appropriate (or useful, or whatever the right measure is). So, no problem. – abarnert Jan 07 '13 at 05:29
0

First, a disclaimer: You shouldn't be using regular expressions to parse HTML. You can use BeautifulSoup for this

Next, if you are actually serious about using regular expressions and the above is the exact case you want then you could do something like:

<a href="[a-zA-Z0-9/]+"><img src="[a-zA-Z0-9/]+" alt="([a-zA-Z0-9/]+)" />

and you could access the text via the match object's groups attribute.

Community
  • 1
  • 1
Ian Stapleton Cordasco
  • 26,944
  • 4
  • 67
  • 72
  • What reason do we have to believe that being inside an `a` tag—and one with a relative URL, and no other attributes—is at all relevant here? In the absence of a realistic problem statement (which would make the problem impossible), it's probably best to assume the simplest possible interpretation. – abarnert Jan 07 '13 at 05:20
  • He was fairly specific. In being as simple as possible you're also not answering his question as accurately as possible. I doubt he would have pulled the example out of nowhere if it weren't something he were dealing with specifically. If he can guarantee the conditions he provided (which he likely cannot) then the above works. – Ian Stapleton Cordasco Jan 07 '13 at 05:24