2

First I am painfully aware that parsing HTML with regex is not "good form". However, I am dealing with poorly formed HTML that does not validate when parsed with such tools as lxml.

My goal is to select only the span elements that contain a br element. Below is my attempt:

Setup the sample input

import re
xx= '<div> <span>123</span> <span>456 <br> 789</span>  </div>'

This identifies the two spans properly but only when the ? is present.I don't understand why this is the case.

re.findall('<span>.*?</span>', xx)
['<span>123</span>', '<span>456 <br> 789</span>']    

I would have thought this would have only selected the span with the br tag present but instead, it selects the start span tag from the first span and the end span tag from the last span making only one entry selected.

re.findall('<span>.*?<br>.*?</span>', xx)
['<span>123</span> <span>456 <br> 789</span>']

Please explain why I am seeing this behavior.

pault
  • 41,343
  • 15
  • 107
  • 149
Alex
  • 1,891
  • 3
  • 23
  • 39
  • 5
    Use a parser: [H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – ctwheels Mar 23 '18 at 20:11
  • 1
    @ctwheels that was my first thought to, but OP said a parser doesn't work because the text does not validate. – pault Mar 23 '18 at 20:12
  • 1
    There are parsers that can work with invalid markup. Where `lxml` may have failed, another parser may be adequate for the case. Especially given what is being extracted (at least in the given example) IS valid markup. – sytech Mar 23 '18 at 20:15
  • 2
    Use beautifulsoup – MK. Mar 23 '18 at 20:17
  • @MK. - I just found that package. I am going to try it. Hope it works. – Alex Mar 23 '18 at 20:19
  • @pault It's just a matter of seeing numbers at left side of a comment – revo Mar 23 '18 at 20:23
  • @Alex oh it will. – MK. Mar 23 '18 at 20:30

1 Answers1

0

Depending on your other requirements, you could do something like

re.findall('<span>[^<]*<br>.*?</span>', xx)

to match only the span with a <br>, But in general, use some parser as the comments suggest.

matli
  • 27,922
  • 6
  • 37
  • 37