python regular expression matches 0 or 1 repetitions

Question

I want to match html headers <h1> - <h6> in html with python regular expression. Some of the headers contain 'id' attribute, and I want to put it into a group.

By trying the following expression I get the one with id attribute.

>>>re.findall(r'<h[1-6].*?(id=\".*?\").*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['id="header2"']

The question mark cause the RE to match 0 or 1 repetitions of preceding RE. If i put a ? after the right parenthesis, it will return two empty strings.

>>>re.findall(r'<h[1-6].*?(id=\".*?\")?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', '']

How to use one regular expression to get the following result?

['', 'id="header2"']

Try removing the question mark after `(id=\".*?\")` in your second regex. — Jerry, Aug 19 '13 at 12:54
First read: [this](http://stackoverflow.com/a/1732454/2199958) and then use [BeautifulSoup](https://pypi.python.org/pypi/BeautifulSoup) :) — Viktor Kerkez, Aug 19 '13 at 12:57
Then it is the same as the first regex, output is `['id="header2"']` — aaron cheung, Aug 19 '13 at 13:03

score 5 · Accepted Answer · answered Aug 19 '13 at 13:00

You are using the wrong tool. Don't use regular expressions to parse HTML. Use a HTML parser instead.

The BeautifulSoup library makes your task trivial:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)

headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print [h.attrs.get('id', '') for h in headers]

Demo:

>>> from bs4 import BeautifulSoup
>>> htmlsource = '<h1>Header1</h1><h2 id="header2">header2</h2>'
>>> soup = BeautifulSoup(htmlsource)
>>> headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
>>> [h.attrs.get('id', '') for h in headers]
['', 'header2']

DeltaKappa · Answer 2 · 2013-08-19T15:03:39.783

1

The '.' does not match spaces so you need to explicitly include them. One possibility would be:

>>> re.findall(r'<h[1-6].*?( +id=\".*?\" ?)?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', ' id="header2"']

edited Aug 19 '13 at 15:03

answered Aug 19 '13 at 13:26

DeltaKappa

171
7

python regular expression matches 0 or 1 repetitions

2 Answers2