1

I've got a bunch of HTML pages, in which I'd like to convert CSS-formatted text snippets into standard HTML tags. e.g <span class="bold">some text</span> will become <b>some text</b>

I'm stuck at nested span fragments:

<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>

I'd like to convert the fragment using Python's regex library. What would be the optimal strategy to regex search-&-replace the above input?

masroore
  • 9,668
  • 3
  • 23
  • 28
  • 1
    Why must it be done by regular expression? – hwnd Dec 10 '13 at 05:16
  • It's just a personal preference. I know it could be done with recusive plain string search... But somehow I find regex solutions to be more elegant... – masroore Dec 10 '13 at 05:18
  • 2
    The optimal strategy would really be to use something other than regular expressions, which are terribly underpowered for this. [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/) is the most popular go-to solution for parsing HTML in Python. – Tim Pierce Dec 10 '13 at 05:19
  • It probably won't be so elegant. To do tag balancing, you need something stronger than regex. If you still want to use regular expressions, you'll need to use a loop. – Michael Dec 10 '13 at 05:20
  • @hwnd I'm using the following regex pattern: `(?P[^[]+)` . I'm replacing the tags based on `css_class` - all the CSS classes have their replacement tags in a `dict` object – masroore Dec 10 '13 at 05:22
  • @Michael Thanks for the heads up. I found [this C#](http://stackoverflow.com/a/3921471/1656343) solution. Wondering how to translate it to my python code... – masroore Dec 10 '13 at 05:26
  • 1
    The ultimate html-regex rant is [here](http://stackoverflow.com/a/1732454/3030305). – John1024 Dec 10 '13 at 05:30

1 Answers1

1

My solution using lxml and cssselect and a bit of Python:

#!/usr/bin/env python

import cssselect  # noqa
from lxml.html import fromstring


html = """
<span class="italic"><span class="bold">XXXXXXXX</span></span>
<span class="italic">some text<span class="bold">nested text<span class="underline">deep nested text</span></span></span>
"""

class_to_style = {
    "underline": "u",
    "italic": "i",
    "bold": "b",
}

output = []
doc = fromstring(html)
spans = doc.cssselect("span")
for span in spans:
    if span.attrib.get("class"):
        output.append("<{0}>{1}</{0}>".format(class_to_style[span.attrib["class"]], span.text or ""))
print "".join(output)

Output:

<i></i><b>XXXXXXXX</b><i>some text</i><b>nested text</b><u>deep nested text</u>

NB: This is a naive solution and does not produce the correct output as you'd have to keep a queue of open tags and close them at the end.

James Mills
  • 18,669
  • 3
  • 49
  • 62
  • 1
    Awesome! I was unaware of cssselect until now! Thanks @James Mills ! – masroore Dec 10 '13 at 05:41
  • Oops! It doesn't work as expected.. the output should be: `XXXXXXXXsome textnested textdeep nested text` – masroore Dec 10 '13 at 05:44
  • Yes my solution is naive at best. You'll have to keep a queue of open tags and close them at the end. I'm sure you can do this? :) Updated my answer to reflect this. (*Have to leave you a little work!*) – James Mills Dec 10 '13 at 05:46
  • You're right I'm exploring csselect & spyda. Thanks for the heads up! – masroore Dec 10 '13 at 05:47