How to parse multiple
html with Python?

Question

I have a document like this:

 TEXT
 TEXT
 <ul>
  <li>1</li>
  <ul>
   <li>2</li>
   <li>3</li>
  </ul>
  <li>4</li>
 </ul>
 ANOTHER TEXT

What can I use to transform it into:

TEXT
TEXT
* 1
** 2
** 3
* 4
ANOTHER TEXT

I need to parse the ul/li parts only, TEXT (it doesn't have ul/li) should be left intact without any changes.

I wrote a parser

def uls(str):
    str = re.sub(r'<li>(.*?)</li>', r"<li><!!\1></li>", str, flags=re.M | re.U | re.MULTILINE | re.DOTALL)
    ret_text = []

    ul_level = 0
    text = ''

    pattern = re.compile(r'(<.*?>)')
    for tag in re.findall(pattern, str):
        if tag == '<ul>':
            ul_level += 1
        if tag == '</ul>':
            ul_level -= 1
            if ul_level == 0:
                ret_text.append(text)
                text = ''
        if re.search(r'<!!(.*?)>', tag, re.M | re.U | re.MULTILINE | re.DOTALL):
            text = text + ('*' * ul_level) + re.sub(r'<!!(.*?)>', r' \1\n', tag, re.M | re.U | re.MULTILINE | re.DOTALL)

    return ret_text

It's produces correct array, but how can I replace

...

with this code?

score 2 · Answer 1 · answered Mar 30 '22 at 13:46

First and foremost, don't parse html with regex!; use a proper parser. Second, even with a proper parser it's going to be difficult to get you to your expected output. The following (admittedly, somewhat hackish) should be you close enough...

import lxml.html as lh #you'll have to read up on lxml/xpath...

ht = """<html>TEXT1
 TEXT2
 <ul>
  <li>1</li>
  <ul>
   <li>2</li>
   <li>3</li>
  </ul>
  <li>4</li>
 </ul>
 ANOTHER TEXT3
 </html>"""

doc = etree.fromstring(ht)
tree = etree.ElementTree(doc)

txts = ['text','tail']
for elem in doc.xpath('//*'):
    for txt in txts:
        try:
            target= eval(f'elem.{txt}').strip()
            if target:
                #the next line counts the number of tiers and prints the appropriate number of '*'s:
                print(tree.getelementpath(elem).count('/') * "*", target)
        except:
            continue

Output:

 TEXT1
 TEXT2
 ANOTHER TEXT3
* 1
** 2
** 3
* 4

As I said, pretty close.

How to parse multiple html with Python?

1 Answers1

How to parse multiple
html with Python?