Converting HTML list (
) to tabs (i.e. indentation)

Question

Have worked in dozens of languages but new to Python.

My first (maybe second) question here, so be gentle...

Trying to efficiently convert HTML-like markdown text to wiki format (specifically, Linux Tomboy/GNote notes to Zim) and have gotten stuck on converting lists.

For a 2-level unordered list like this...

First level
- Second level

Tomboy/GNote uses something like...

<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>

However, the Zim personal wiki wants that to be...

* First level
  * Second level

... with leading tabs.

I've explored the regex module functions re.sub(), re.match(), re.search(), etc. and found the cool Python ability to code repeating text as...

 count * "text"

Thus, it looks like there should be a way to do something like...

 newnote = re.sub("<list>", LEVEL * "\t", oldnote)

Where LEVEL is the ordinal (occurrance) of <list> in the note. It would thus be 0 for the first <list> incountered, 1 for the second, etc.

LEVEL would then be decremented each time </list> was encountered.

<list-item> tags are converted to the asterisk for the bullet (preceded by newline as appropriate) and </list-item> tags dropped.

Finally... the question...

How do I get the value of LEVEL and use it as a tabs multiplier?

Off the top of my head, use an html/xml parser such as BeautifulSoup or xml.dom.minidom, use a recursive function or use a stack/queue to open/close tags and count tablevels. Basically, you want to convert the markup text into usable data. then convert this code-friendly data to your other style of markup. — Joel Cornett, Apr 15 '12 at 10:24
don't use `re`. it's not very effective at dealing with nested tags. — Joel Cornett, Apr 15 '12 at 10:25
probably relevant: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ;) — mensi, Apr 15 '12 at 10:48
I'll study the html2text.py program for techniques but my what I'm convert isn't actually HTML. — DocSalvager, Apr 15 '12 at 12:08

score 4 · Accepted Answer · answered Apr 15 '12 at 12:19

You should really use an xml parser to do this, but to answer your question:

import re

def next_tag(s, tag):
    i = -1
    while True:
        try:
            i = s.index(tag, i+1)
        except ValueError:
            return
        yield i

a = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"

a = a.replace("<list-item>", "* ")

for LEVEL, ind in enumerate(next_tag(a, "<list>")):
    a = re.sub("<list>", "\n" + LEVEL * "\t", a, 1)

a = a.replace("</list-item>", "")
a = a.replace("</list>", "")

print a

This will work for your example, and your example ONLY. Use an XML parser. You can use xml.dom.minidom (it's included in Python (2.7 at least), no need to download anything):

import xml.dom.minidom

def parseList(el, lvl=0):
    txt = ""
    indent = "\t" * (lvl)
    for item in el.childNodes:
        # These are the <list-item>s: They can have text and nested <list> tag
        for subitem in item.childNodes:
            if subitem.nodeType is xml.dom.minidom.Element.TEXT_NODE:
                # This is the text before the next <list> tag
                txt += "\n" + indent + "* " + subitem.nodeValue
            else:
                # This is the next list tag, its indent level is incremented
                txt += parseList(subitem, lvl=lvl+1)
    return txt

def parseXML(s):
    doc = xml.dom.minidom.parseString(s)
    return parseList(doc.firstChild)

a = "<list><list-item>First level<list><list-item>Second level</list-item><list-item>Second level 2<list><list-item>Third level</list-item></list></list-item></list></list-item></list>"
print parseXML(a)

Output:

* First level
    * Second level
    * Second level 2
        * Third level

I think the re solution is going to work for this little conversion program, but I'm definitely hanging onto the xml parser solution for some other things. THANK YOU ALL for such great responses in such a short time. (Programmers DO own the night!) — DocSalvager, Apr 15 '12 at 13:16
@RobertC yes you might be able to use the re solution for this, but I believe it will not work if you have more than 2 levels of nested tags. You might need to change it a bit for it to work. The xml parser solution should work with everything. And you're welcome ;) — jadkik94, Apr 15 '12 at 14:27

Rachid · Answer 2 · 2012-04-15T11:38:05.957

Use Beautiful soup , it allows you to iterate in the tags even if they are customs. Very pratical for doing this type of operation

from BeautifulSoup import BeautifulSoup
tags = "<list><list-item>First level<list><list-item>Second level</list-item></list></list-item></list>"
soup = BeautifulSoup(tags)
print [[ item.text for item in list_tag('list-item')]  for list_tag in soup('list')]

Output : [[u'First level'], [u'Second level']]

I used a nested list comprehension but you can use a nested for loop

for list_tag in soup('list'):
     for item in list_tag('list-item'):
         print item.text

I hope that helps you.

In my example I used BeautifulSoup 3 but the example should work with BeautifulSoup4 but only the import change.

from bs4 import BeautifulSoup

This looks great! I'd vote up but not high enough rep yet. I'm gonna try this. — DocSalvager, Apr 15 '12 at 12:10

Converting HTML list () to tabs (i.e. indentation)

2 Answers2

Converting HTML list (
) to tabs (i.e. indentation)