Python Regex and HTML

Question

First I am painfully aware that parsing HTML with regex is not "good form". However, I am dealing with poorly formed HTML that does not validate when parsed with such tools as lxml.

My goal is to select only the span elements that contain a br element. Below is my attempt:

Setup the sample input

import re
xx= '<div> <span>123</span> <span>456 <br> 789</span>  </div>'

This identifies the two spans properly but only when the ? is present.I don't understand why this is the case.

re.findall('<span>.*?</span>', xx)
['<span>123</span>', '<span>456 <br> 789</span>']

I would have thought this would have only selected the span with the br tag present but instead, it selects the start span tag from the first span and the end span tag from the last span making only one entry selected.

re.findall('<span>.*?<br>.*?</span>', xx)
['<span>123</span> <span>456 <br> 789</span>']

Please explain why I am seeing this behavior.

Use a parser: [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) — ctwheels, Mar 23 '18 at 20:11
@ctwheels that was my first thought to, but OP said a parser doesn't work because the text does not validate. — pault, Mar 23 '18 at 20:12
There are parsers that can work with invalid markup. Where `lxml` may have failed, another parser may be adequate for the case. Especially given what is being extracted (at least in the given example) IS valid markup. — sytech, Mar 23 '18 at 20:15
@MK. - I just found that package. I am going to try it. Hope it works. — Alex, Mar 23 '18 at 20:19
@pault It's just a matter of seeing numbers at left side of a comment — revo, Mar 23 '18 at 20:23

score 0 · Answer 1 · answered Mar 23 '18 at 20:38

0

Depending on your other requirements, you could do something like

re.findall('<span>[^<]*<br>.*?</span>', xx)

to match only the span with a <br>, But in general, use some parser as the comments suggest.

answered Mar 23 '18 at 20:38

matli

27,922
6
37
37

Python Regex and HTML

1 Answers1