I have a bunch of improperly formatted Chinese html files. They contain unnecessary spaces and line breaks which will be displayed as extra spaces in the browser. I've written a script using lxml to modify the html files. It works fine on simple tags, but I'm stuck on nested ones. For example:
<p>祝你<span>19</span>岁
生日快乐。</p>
will be displayed is the browser as:
祝你19岁 生日快乐。
Notice the extra space. This is what needs to be deleted. The result html should be like this:
<p>祝你<span>19</span>岁生日快乐。</p>
How do I do this?
Note that the nesting(like the span tag) could be arbitrary, but I don't need to consider the content in the nested elements, they should be preserved as they are. Only the text in the outer element needs to by formatted.
This is what I've got:
# -*- coding: utf-8 -*-
import lxml.html
import re
s1 = u"""<p>祝你19岁
生日快乐。</p>"""
p1 = lxml.html.fragment_fromstring(s1)
print p1.text # I get the whole line.
p1.text = re.sub("\s+", "", p1.text)
print p1.tostring() # spaces are removed.
s2 = u"""<p>祝你<span>19</span>岁
生日快乐。</p>"""
p2 = lxml.html.fragment_fromstring(s2)
print p2.text # I get "祝你"
print p2.tail # I get None
i = p2.itertext()
print i.next() # I get "祝你"
print i.next() # I get "19" from <span>
print i.next() # I get the tailed text, but how do I assemble them back?
print p2.text_content() # The whole text, but how do I put <span> back?