-1

I want to match html headers <h1> - <h6> in html with python regular expression. Some of the headers contain 'id' attribute, and I want to put it into a group.

By trying the following expression I get the one with id attribute.

>>>re.findall(r'<h[1-6].*?(id=\".*?\").*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['id="header2"']

The question mark cause the RE to match 0 or 1 repetitions of preceding RE. If i put a ? after the right parenthesis, it will return two empty strings.

>>>re.findall(r'<h[1-6].*?(id=\".*?\")?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', '']

How to use one regular expression to get the following result?

['', 'id="header2"']
aaron cheung
  • 532
  • 2
  • 10

2 Answers2

5

You are using the wrong tool. Don't use regular expressions to parse HTML. Use a HTML parser instead.

The BeautifulSoup library makes your task trivial:

from bs4 import BeautifulSoup

soup = BeautifulSoup(htmlsource)

headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
print [h.attrs.get('id', '') for h in headers]

Demo:

>>> from bs4 import BeautifulSoup
>>> htmlsource = '<h1>Header1</h1><h2 id="header2">header2</h2>'
>>> soup = BeautifulSoup(htmlsource)
>>> headers = soup.find_all(['h1', 'h2', 'h3', 'h4', 'h5', 'h6'])
>>> [h.attrs.get('id', '') for h in headers]
['', 'header2']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
1

The '.' does not match spaces so you need to explicitly include them. One possibility would be:

>>> re.findall(r'<h[1-6].*?( +id=\".*?\" ?)?.*?</h[1-6].*?>','<h1>Header1</h1><h2 id="header2">header2</h2>')
['', ' id="header2"']
DeltaKappa
  • 171
  • 7