0

I want to take my text below, and assemble it into a list of objects as shown below. I know this can be done with regex somehow. Please assist.

Starting html text:

peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4

Desired output:

list = [
    ['peanut butter1', 'no tag'],
    ['peanut butter2', 'ul'],
    ['2.0 to 6.0 mg of 17&#x3b2;-estradiol and', 'li'],
    ['0.020 mg of ethinylestradiol;', 'li'],
    ['<br>', 'no tag'],
    ['0.25 to 0.30 mg of drospirenone and', 'li'],
    ['peanut butter3', 'no tag'],
    ['peanut butter4', 'no tag'],
]
user2104778
  • 992
  • 1
  • 14
  • 38
  • 2
    No! Don't use a regex to parse HTML! http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 You should probably use something like `BeautifulSoup`. – anon582847382 Mar 13 '14 at 18:44
  • Please read: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags answer – Ruben Bermudez Mar 13 '14 at 18:44
  • Regular expressions aren't well suited to parse arbitrarily nested structures. Use a parser instead. – Joel Cornett Mar 13 '14 at 18:57

1 Answers1

1

I concur with the previous comments about parsing HTML. However, for the fun and assuming a line by line parsing, you can try something like the following:

ss="""
peanut butter1
<ul id="ul0002" list-style="none">peanut butter2
    <li id="ul0002-0001" num="0000">2.0 to 6.0 mg of 17&#x3b2;-estradiol and</li>
    <li id="ul0002-0002" num="0000">0.020 mg of ethinylestradiol;</li>
    <br>
    <li id="ul0002-0003" num="0000">0.25 to 0.30 mg of drospirenone and</li>peanut butter3
</ul>peanut butter4
"""
import re
tags = re.compile (r".*?<([^/]\w*?) .*?>(.*?)</\1>") # find tag like <li ...>...</li>
start = re.compile(r".*?<([^/]\w*?) .*?>(.*)") # find starting tags with attributes
end = re.compile(r"</.*?>")
r=[]
for s in ss.split("\n"):
    if not s.strip(): continue
    st = re.match(start,s)
    if st: # start tag exists
        m = re.match(tags,s) 
        if m: # full terminated tag
            r.append(list(reversed(m.groups())))
            extra = s[m.end():].strip()
            if extra:
                r.append([extra,"no tag"])
        else: # half tag start
            r.append(list(reversed(st.groups())))
    else: # no start tag
        s = re.sub(end, "", s) # remove closing tags
        r.append([s.strip(),"no tag"])
print "\n".join([str(s) for s in r])

Hope this helps!

Yano
  • 598
  • 5
  • 12
  • 1
    np. I have used re quite a lot for limited scraping. When the htlm is not well-written (e.g. unbalanced tags) it is sometimes more robust than an actual parser, or even BS, and it can be quite faster. – Yano Mar 14 '14 at 18:40