Removing extra spaces in Chinese HTML files using lxml

Question

I have a bunch of improperly formatted Chinese html files. They contain unnecessary spaces and line breaks which will be displayed as extra spaces in the browser. I've written a script using lxml to modify the html files. It works fine on simple tags, but I'm stuck on nested ones. For example:

<p>祝你<span>19</span>岁
    生日快乐。</p>

will be displayed is the browser as:

祝你19岁 生日快乐。

Notice the extra space. This is what needs to be deleted. The result html should be like this:

<p>祝你<span>19</span>岁生日快乐。</p>

How do I do this?

Note that the nesting(like the span tag) could be arbitrary, but I don't need to consider the content in the nested elements, they should be preserved as they are. Only the text in the outer element needs to by formatted.

This is what I've got:

# -*- coding: utf-8 -*-

import lxml.html
import re

s1 = u"""<p>祝你19岁
    生日快乐。</p>"""
p1 = lxml.html.fragment_fromstring(s1)
print p1.text         # I get the whole line.
p1.text = re.sub("\s+", "", p1.text)
print p1.tostring()   # spaces are removed.

s2 = u"""<p>祝你<span>19</span>岁
    生日快乐。</p>"""
p2 = lxml.html.fragment_fromstring(s2)
print p2.text     # I get "祝你"
print p2.tail     # I get None
i = p2.itertext()
print i.next()   # I get "祝你"
print i.next()   # I get "19" from <span>
print i.next()   # I get the tailed text, but how do I assemble them back?
print p2.text_content()  # The whole text, but how do I put <span> back?

Good question - I don't have an answer off the top of my head, but my best guess would be that you have to walk the tree structure (recursively or iteratively as you prefer), removing the extra spaces. — Marcin, Mar 19 '12 at 10:40

score 2 · Answer 1 · answered Mar 19 '12 at 12:38

2

>>> root = etree.fromstring('<p>祝你<span>19</span>岁\n生日快乐。</p>')
>>> etree.tostring(root)
b'<p>&#31069;&#20320;<span>19</span>&#23681;\n&#29983;&#26085;&#24555;&#20048;&#12290;</p>'

>>> for e in root.xpath('/p/*'):
...   if e.tail:
...     e.tail = e.tail.replace('\n', '')
...

>>> etree.tostring(root)
b'<p>&#31069;&#20320;<span>19</span>&#23681;&#29983;&#26085;&#24555;&#20048;&#12290;</p>'

answered Mar 19 '12 at 12:38

kev

155,172
47
273
272

Thank you. I should look more into xpath. I accepted Matt's answer because it's more comprehensive. – Wang Dingwei Mar 20 '12 at 03:33

score 1 · Accepted Answer · edited May 23 '17 at 12:20

Controversially, I wonder whether this is possible to complete without using an HTML/XML parser, considering that it appears to be cause by line wrapping.

I built a regular expression to look for whitespace between Chinese text with the help of this solution here: https://stackoverflow.com/a/2718268/267781

I don't know whether the catch-all of any whitespace between characters or whether the more specific [char]\n\s*[char] is most suitable to your problem.

# -*- coding: utf-8 -*-
import re

# Whitespace in Chinese HTML
## Used this solution to create regexp: https://stackoverflow.com/a/2718268/267781
## \s+
fixwhitespace2 = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\s+)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
## \n\s*
fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)

sample = u'<html><body><p>\u795d\u4f6019\u5c81\n    \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'

fixwhitespace.sub('',sample)

Yielding

<html><body><p>祝你19日快乐。</p></body></html>

However, here's how you might do it using the parser and xpath to find linefeeds:

# -*- coding: utf-8 -*-
from lxml import etree
import re

fixwhitespace = re.compile(u'[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d](\n\s*)[\u2e80-\u2e99\u2e9b-\u2ef3\u2f00-\u2fd5\u3005\u3007\u3021-\u3029\u3038-\u303a\u303b\u3400-\u4db5\u4e00-\u9fc3\uf900-\ufa2d\ufa30-\ufa6a\ufa70-\ufad9\U00020000-\U0002a6d6\U0002f800-\U0002fa1d]',re.M)
sample = u'<html><body><p>\u795d\u4f6019\u5c81\n    \u751f\u65e5\u5feb\u4e50\u3002</p></body></html>'

doc = etree.HTML(sample)
for t in doc.xpath("//text()[contains(.,'\n')]"):
  if t.is_tail:
    t.getparent().tail = fixwhitespace.sub('',t)
  elif t.is_text:
    t.getparent().text = fixwhitespace.sub('',t)

print etree.tostring(doc)

Yields:

<html><body><p>&#31069;&#20320;19&#26085;&#24555;&#20048;&#12290;</p></body></html>

I'm curious what the best match to your working data is.

Thanks! The pure regex didn't work for some content but xpath + regex worked very well. — Wang Dingwei, Mar 20 '12 at 03:29
*pure regex didn't work for some content* - I guess this validates all the warnings and caution regarding regexps and HTML/XML! I'm glad one of the approaches was effective across your dataset. — MattH, Mar 20 '12 at 08:48

Removing extra spaces in Chinese HTML files using lxml

2 Answers2