0

What would be the best way to split a HTML document/string based on the occurrence of the
tag? I have given the solution I currently have below but it seems quite cumbersome and isn't all that easy to read I think. I also experimented with regex's but I'm told I should not use regex's to parse HTML

for i, br in enumerate(soup.findAll('b')):
line_value = ''
line_values = []
next = br.next
while (next):
    if next and isinstance(next, Tag) and next.name == 'br':
        line_values.append(line_value)
        line_value = ''
    else:
        stripped_text = ''.join(BeautifulSoup(str(next).strip()).findAll(text=True))
        if stripped_text:
            line_value += stripped_text
    next = next.nextSibling
print line_values

Here's a sample of the HTML I'm parsing:

<p><font size="1" color="#800000"><b>09:00
  <font> - </font>
  11:00
  <br>
  CE4817
  <font> - </font>LAB <font>- </font>
  2A
  <br>
   B2043 B2042
  <br>

  Wks:1-13
  </b></font>
  </p>

And the current results of my code:

[u'09:00 - 11:00', u'CE4817 - LAB- 2A', u'B2043 B2042']
[u'11:00 - 12:00', u'CE4607 - TUT- 3A', u'A1054']
stephenfin
  • 1,447
  • 3
  • 20
  • 41

2 Answers2

0

To split with regexes

import re
p = re.compile(r'<br>')
filter(None, p.split(yourString))

Then you can remove the other html tags from each of the returned strings in the array.

You can either use an existing function, as in Strip html from strings in python or check my answer to the question Stripping HTML tags without using HtmlAgilityPack.

Check also this answer: RegEx match open tags except XHTML self-contained tags

You should really use an html parser to accomplish your task

Community
  • 1
  • 1
Gabber
  • 5,152
  • 6
  • 35
  • 49
0

Try this :

Regex

<p><font size="1" color="#800000"><b>(\d{2}:\d{2}).*?(\d{2}:\d{2}).*?(\w{2}\d{4}).*?<font> - </font>(\w+)\s*<font>- </font>\s*(\d\w)\s*<br>\s*(\w\d{4}\s*\w\d{4})\s*<br>[\s\S]*?</p>

Mode

m - multiline

This will work as long as the structure of the html code doesn't change.

Stephan
  • 41,764
  • 65
  • 238
  • 329