13

I am trying to clean up an HTML table using lxml.html.clean.Cleaner(). I need to strip JavaScript attributes, but would like to preserve inline CSS style. I thought style=False is the default setup:

import lxml.html.clean
cleaner = lxml.html.clean.Cleaner()

however when I call cleaner.clean_html(doc)

<span style="color:#008800;">67.51</span>

will become

<span>67.51</span>

Basically, style is not preserved. I tried to add:

cleaner.style= False

It doesn't help.

Update: I am using Python 2.6.6 + lxml 3.2.4 on Dreamhost, and Python 2.7.5 + lxml 3.2.4 on local Macbook. Same results. Another thing: there is a javacript-related attribute in my html:

<td style="cursor:pointer;">Ticker</td>

Could it be lxml stripped this JavaScript related style and treated other styles the same? I hope not.

karel
  • 5,489
  • 46
  • 45
  • 50
laviex
  • 593
  • 7
  • 13

1 Answers1

13

It works if you set cleaner.safe_attrs_only = False.

The set of "safe" attributes (Cleaner.safe_attrs) is defined in the lxml.html.defs module (source code) and style is not included in the set.

But even better than cleaner.safe_attrs_only = False is to use Cleaner(safe_attrs=lxml.html.defs.safe_attrs | set(['style'])). This will preserve style and at the same time protect from other unsafe attributes.

Demo code:

from lxml import html
from lxml.html import clean

s ='<marquee><span style="color: #008800;">67.51</span></marquee>'
doc = html.fromstring(s)
cleaner = clean.Cleaner(safe_attrs=html.defs.safe_attrs | set(['style']))

print html.tostring(cleaner.clean_html(doc))

Output:

<div><span style="color: #008800;">67.51</span></div>
mzjn
  • 48,958
  • 13
  • 128
  • 248
  • It indeed works! Thanks a lot. Now I am wondering why style=False won't work. I guess it might because of this code, some trade-off between javascript and style (AND safe_attrs). Thanks for the workaround and showing me the source code to read more – laviex Dec 07 '13 at 07:38
  • It's nice solution! – JuHong Jung Oct 30 '16 at 13:21