5

I'm trying to make a simple Python-based HTML parser using regular expressions. My problem is trying to get my regex search query to find all the possible matches, then store them in a tuple.

Let's say I have a page with the following stored in the variable HTMLtext:

<ul>
<li class="active"><b><a href="/blog/home">Back to the index</a></b></li>
<li><b><a href="/blog/about">About Me!</a></b></li>
<li><b><a href="/blog/music">Audio Production</a></b></li>
<li><b><a href="/blog/photos">Gallery</a></b></li>
<li><b><a href="/blog/stuff">Misc</a></b></li>
<li><b><a href="/blog/contact">Shoot me an email</a></b></li>
</ul>

I want to perform a regex search on this text and return a tuple containing the last URL directory of each link. So, I'd like to return something like this:

pages = ["home", "about", "music", "photos", "stuff", "contact"]

So far, I'm able to use regex to search for one result:

pages = [re.compile('<a href="/blog/(.*)">').search(HTMLtext).group(1)]

Running this expression makespages = ['home'].

How can I get the regex search to continue for the whole text, appending the matched text to this tuple?

(Note: I know I probably should NOT be using regex to parse HTML. But I want to know how to do this anyway.)

Community
  • 1
  • 1
hao_maike
  • 2,929
  • 5
  • 26
  • 31

5 Answers5

2

Use findall function of re module:

pages = re.findall('<a href="/blog/([^"]*)">',HTMLtext)
print(pages)

Output:

['home', 'about', 'music', 'photos', 'stuff', 'contact']
ovgolovin
  • 13,063
  • 6
  • 47
  • 78
  • @tchrist You are right. I didn't look in the pattern itself. The way the OP wrote it `.*` consumes all the symbols till the end of the line and then backtracks to match the following `"` which slows down the parsing. I'll correct the pattern in my answer. – ovgolovin Mar 24 '12 at 20:57
  • That doesn’t work unless there are newlines in the HTML — which is rare — and there is only one such link per line. See my answer for how to fix. Yes, I like your fix: the negated charclass is going to be not only more efficient, but also *more correct*, than a minimal match. – tchrist Mar 24 '12 at 20:59
2

Your pattern won’t work on all inputs, including yours. The .* is going to be too greedy (technically, it finds a maximal match), causing it to be the first href and the last corresponding close. The two simplest ways to fix this is to use either a minimal match, or else a negates character class.

# minimal match approach
pages = re.findall(r'<a\s+href="/blog/(.+?)">', 
                   full_html_text, re.I + re.S)

# negated charclass approach
pages = re.findall(r'<a\s+href="/blog/([^"]+)">',
                   full_html_text, re.I)

Obligatory Warning

For simple and reasonably well-constrained text, regexes are just fine; after all, that’s why we use regex search-and-replace in our text editors when editing HTML! However, it gets more and more complicated the less you know about the input, such as

  • if there’s some other field intervening between the <a and the href, like <a title="foo" href="bar">
  • casing issues like <A HREF='foo'>
  • whitespace issues
  • alternate quotes like href='/foo/bar' instead of href="/foo/bar"
  • embedded HTML comments

That’s not an exclusive list of concerns; there are others. And so, using regexes on HTML thus is possible but whether it’s expedient depends on too many other factors to judge.

However, from the little example you’ve shown, it looks perfectly ok for your own case. You just have to spiff up your pattern and call the right method.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180
  • From what I've read, negated character class is faster than non-greedy quantifier (because it avoids a lot of backtracking steps). – ovgolovin Mar 24 '12 at 21:01
  • @ovgolovin You are 100% right that the negated charclass is faster. There is also a correctness issue. In general, a pattern like `A.*?B` does not actually stop `B` from occurring in the `.*?` part; for that, you have to include a lookahead negation, like `A(?:(?!B).)*B`. This can happen if you write `A.*?BC` because to make `C` true, it may have to include `B` in the `.*?`. Simplistically such a string is `"AxxxBxxxBC"`. – tchrist Mar 24 '12 at 21:04
  • @tchrist Thanks for this elegant solution (and the informative warning). I'm just learning regex, so the discussion about greedy/non-greedy patterns is very helpful. – hao_maike Mar 24 '12 at 21:08
  • 1
    @mr_schlomo If you’re just learning regexes, you’ll want to get into the habit of using **raw strings** for Python patterns, like `r'…'`, to avoid double backslashes. You might look at [my other regex answers](http://stackoverflow.com/search?q=user%3A471272+%5Bregex%5D). It’s true that most (albeit not all) of them are in Perl, but often this doesn’t matter, as the pattern translates directly into Python without any trouble. For the hairier examples that involve ***Unicode properties*** like `\p{Greek}` or `\p{Dash}`, you’d have to use Matthew Barnett’s `regex` library for Python 2 and 3 both. – tchrist Mar 24 '12 at 21:17
1

Use findall instead of search:

>>> pages = re.compile('<a href="/blog/(.*)">').findall(HTMLtext)
>>> pages
['home', 'about', 'music', 'photos', 'stuff', 'contact']
Simeon Visser
  • 118,920
  • 18
  • 185
  • 180
  • @mr_schlomo That won’t work unless there are actually newlines in your HTML, and there is only one such link per line. There are other issues, too; see my answer’s obligatory warning. – tchrist Mar 24 '12 at 20:58
1

The re.findall() function and the re.finditer() function are used to find multiple matches.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
1

To find all results use findall(). Also you need to compile the re only once and then you can reuse it.

href_re = re.compile('<a href="/blog/(.*)">')  # Compile the regexp once

pages = href_re.findall(HTMLtext)  # Find all matches - ["home", "about",
Mariusz Jamro
  • 30,615
  • 24
  • 120
  • 162
  • That won’t work on most HTML pages, because you are assuming newlines to stop the greed `.*`, and also that there is only one link per line. – tchrist Mar 24 '12 at 20:58
  • @tchrist I think nobody actually looked into the pattern. They just answered the question (about `findall`). I don't think it's good to overlook such mistakes, but it's what the things are (nobody cared about anything apart from the actual question). It's very good that you noticed and pointed out the mistake in the pattern. – ovgolovin Mar 24 '12 at 21:07
  • 1
    @ovgolovin It’s gotten so such things just jump right out at me. You might say that I’m a native speaker of regexese, as [these hundreds of answers](http://stackoverflow.com/search?q=user%3A471272+%5Bregex%5D) should show. :) BTW, for Python regexes, I recommend Matthew Barnett’s replacement `regex` module; it handles Unicode ***much, much better*** than the `re` module, and does a bunch of other cool stuff, too. – tchrist Mar 24 '12 at 21:11