First I am painfully aware that parsing HTML with regex is not "good form". However, I am dealing with poorly formed HTML that does not validate when parsed with such tools as lxml.
My goal is to select only the span elements that contain a br element. Below is my attempt:
Setup the sample input
import re
xx= '<div> <span>123</span> <span>456 <br> 789</span> </div>'
This identifies the two spans properly but only when the ? is present.I don't understand why this is the case.
re.findall('<span>.*?</span>', xx)
['<span>123</span>', '<span>456 <br> 789</span>']
I would have thought this would have only selected the span with the br tag present but instead, it selects the start span tag from the first span and the end span tag from the last span making only one entry selected.
re.findall('<span>.*?<br>.*?</span>', xx)
['<span>123</span> <span>456 <br> 789</span>']
Please explain why I am seeing this behavior.