Nested string replace with regex in Python

Question

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g <span class="bold">some text</span> will become <b>some text</b>

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... — masroore, Dec 10 '13 at 05:18
The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is the most popular go-to solution for parsing HTML in Python. — Tim Pierce, Dec 10 '13 at 05:19
It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. — Michael, Dec 10 '13 at 05:20
@hwnd I'm using the following regex pattern: `(?P[^[]+)` . I'm replacing the tags based on `css_class` - all the CSS classes have their replacement tags in a `dict` object — masroore, Dec 10 '13 at 05:22
@Michael Thanks for the heads up. I found [this C#](http://stackoverflow.com/a/3921471/1656343) solution. Wondering how to translate it to my python code... — masroore, Dec 10 '13 at 05:26
The ultimate html-regex rant is [here](http://stackoverflow.com/a/1732454/3030305). — John1024, Dec 10 '13 at 05:30

James Mills · Answer 1 · 2013-12-10T05:46:55.620

1

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

edited Dec 10 '13 at 05:46

answered Dec 10 '13 at 05:35

James Mills

18,669
3
49
62

1

Awesome! I was unaware of cssselect until now! Thanks @James Mills ! – masroore Dec 10 '13 at 05:41
Oops! It doesn't work as expected.. the output should be: `XXXXXXXXsome textnested textdeep nested text` – masroore Dec 10 '13 at 05:44
Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (*Have to leave you a little work!*) – James Mills Dec 10 '13 at 05:46
You're right I'm exploring csselect & spyda. Thanks for the heads up! – masroore Dec 10 '13 at 05:47

Nested string replace with regex in Python

1 Answers1