How to remove insignificant whitespace in lxml.html?

Question

I'm rather surprised that lxml.html leaves insignificant whitespace when parsing HTML by default. I'm also surprised that I can't find any obvious way to make it not do that.

Python 2.7.3 (default, Apr 10 2013, 06:20:15) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import lxml.etree
>>> parser = lxml.etree.HTMLParser(remove_blank_text=True)
>>> html = lxml.etree.HTML("<p>      Hello     World     </p>", parser=parser)
>>> print lxml.etree.tostring(html)
<html><body><p>      Hello     World     </p></body></html>

I expect the result would be something like:

>>> print lxml.etree.tostring(html)
<html><body><p>Hello World</p></body></html>

BeautifulSoup4 does the same thing with the html5lib parser:

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup("<p>      Hello     World     </p>", "html5lib")
>>> soup.p
<p>      Hello     World     </p>

After doing some research, I found that the HTML5 parsing specification does not specify to remove consecutive whitespace; that is done at render time instead. So I understand that's it technically not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I'm surprised none of them have it anyway.

Can somebody prove me wrong?

Edit:

I know how to remove whitespace using a regex — that was not my question. (I also know how to search SO for questions about regex.)

My question has to do with the insignificant whitespace, where significance is defined by the standards for rendering HTML. I doubt that a 1-liner regex can correctly implement this standard. And let's not even delve into the regex vs CFG debate again, please?

RegEx match open tags except XHTML self-contained tags

Edit 2:

In case it's not clear from the context, I am interested in HTML, not XHTML/XML. Whitespace does have some non-trivial rules of significance in HTML, however those rules are implemented in the renderer, not the parser. I understand that, as evidenced in my initial post. My question is whether anybody has implemented the white space logic of an HTML renderer in a library that operates at the DOM level rather than at the rendering level?

You are not wrong. The specification doesn't call for removing whilespace because this is a rendering/implementation detail; since this really isn't a _problem_, and the solution would slow down parsing; that is probably why it is not included as a feature. — Burhan Khalid, Aug 29 '13 at 05:01
@BurhanKhalid I definitely wouldn't expect it to be enabled by default. It would also make for non-compliant parsing. I wouldn't call it an "implementation detail", either. Handling of whitespace is a very important part of the rendering standard. If it was an implementation detail, then different browsers would render websites quite differently. — Mark E. Haase, Aug 29 '13 at 15:14

Ivan Chaer · Accepted Answer · 2016-03-17T11:10:27.830

4

I came across this library.

Can be installed with pip:

pip install htmlmin

It's used like:

from htmlmin import minify
html=u"<html><body><p>      Hello     World     </p></body></html>"
minified_html = minify(html)
print minified_html

Which returns:

<html><body><p> Hello World </p></body></html>

I thought it would do what you were looking for, but as you see, some irrelevant spaces were kept.

edited Mar 17 '16 at 11:10

answered Mar 17 '16 at 11:04

Ivan Chaer

6,980
1
38
48

Thanks! This is definitely close to what I was thinking of, but it is really buggy. `
testing
testing
` minifies to `
testing
testing`!
– Mark E. Haase Mar 17 '16 at 14:00
That's weird. Here it didn't, look: >>> from htmlmin import minify >>> html=u"
testing
testing
" >>> minified_html = minify(html) >>> print minified_html
testing
testing
What versions of python and pip are you using? – Ivan Chaer Mar 17 '16 at 18:39
Sorry, replies can't be formatted with line breaks on Stack Overflow. Could you understand my last comment? I was just trying to show you that I got `
testing
testing
` printing the minified version of your mentioned example. – Ivan Chaer Mar 17 '16 at 18:41
Ahh, my mistake. I tried the [online version](http://kangax.github.io/html-minifier/) not realizing that uses a Node.js library, not the Python library you were referring to. The Python library does work, thanks! – Mark E. Haase Mar 17 '16 at 20:20

score -2 · Answer 2 · answered Aug 29 '13 at 05:37

Ok. You would like to detect some whitespaces, and get away those in excess.

You can do it with a reg-exp.

from re import sub
sub(r"(\s)+",' ',yourstring)

it'll replace all adjacent whitespaces (when more than one) by one and only one of them

'<p> Hello World </p>'

was my result with this.

I suppose it's close enough to your expectations, and a lone whitespace is always better for readability than none.

With a bit longer regular expression, you should manage to get away whitespaces adjacent to HTML tags.

How to remove insignificant whitespace in lxml.html?

2 Answers2