Split a HTML document at tag - Python

Question

What would be the best way to split a HTML document/string based on the occurrence of the
tag? I have given the solution I currently have below but it seems quite cumbersome and isn't all that easy to read I think. I also experimented with regex's but I'm told I should not use regex's to parse HTML

for i, br in enumerate(soup.findAll('b')):
line_value = ''
line_values = []
next = br.next
while (next):
    if next and isinstance(next, Tag) and next.name == 'br':
        line_values.append(line_value)
        line_value = ''
    else:
        stripped_text = ''.join(BeautifulSoup(str(next).strip()).findAll(text=True))
        if stripped_text:
            line_value += stripped_text
    next = next.nextSibling
print line_values

Here's a sample of the HTML I'm parsing:

<p><font size="1" color="#800000"><b>09:00
  <font> - </font>
  11:00
  <br>
  CE4817
  <font> - </font>LAB <font>- </font>
  2A
  <br>
   B2043 B2042
  <br>

  Wks:1-13
  </b></font>
  </p>

And the current results of my code:

[u'09:00 - 11:00', u'CE4817 - LAB- 2A', u'B2043 B2042']
[u'11:00 - 12:00', u'CE4607 - TUT- 3A', u'A1054']

I'm asking a clarification: do you need to split the html document given a tag or just remove all tags from the input? — Gabber, Sep 25 '12 at 14:05
I need to split on the occurence of the br tag (or another specified tag) — stephenfin, Sep 25 '12 at 14:55

score 0 · Answer 1 · edited May 23 '17 at 11:43

To split with regexes

import re
p = re.compile(r'<br>')
filter(None, p.split(yourString))

Then you can remove the other html tags from each of the returned strings in the array.

You can either use an existing function, as in Strip html from strings in python or check my answer to the question Stripping HTML tags without using HtmlAgilityPack.

Check also this answer: RegEx match open tags except XHTML self-contained tags

You should really use an html parser to accomplish your task

Stephan · Answer 2 · 2012-09-24T15:30:19.763

0

Try this :

Regex

<p><font size="1" color="#800000"><b>(\d{2}:\d{2}).*?(\d{2}:\d{2}).*?(\w{2}\d{4}).*?<font> - </font>(\w+)\s*<font>- </font>\s*(\d\w)\s*<br>\s*(\w\d{4}\s*\w\d{4})\s*<br>[\s\S]*?</p>

Mode

m - multiline

This will work as long as the structure of the html code doesn't change.

edited Sep 24 '12 at 15:30

answered Sep 24 '12 at 15:22

Stephan

41,764
65
238
329

Split a HTML document at tag - Python

2 Answers2