Strip style attributes with nokogiri

Question

I'm scrapling an html page with nokogiri and i want to strip out all style attributes.
How can I achieve this? (i'm not using rails so i can't use it's sanitize method and i don't want to use sanitize gem 'cause i want to blacklist remove not whitelist)

html = open(url)
doc = Nokogiri::HTML(html.read)
doc.css('.post').each do |post|
puts post.to_s
end

=> <p><span style="font-size: x-large">bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

I want it to be

=> <p><span>bla bla <a href="http://torrentfreak.com/netflix-is-killing-bittorrent-in-the-us-110427/">statistica</a> blabla</span></p>

Phrogz · Accepted Answer · 2014-03-16T02:26:02.563

20

require 'nokogiri'

html = '<p class="post"><span style="font-size: x-large">bla bla</span></p>'
doc = Nokogiri::HTML(html)
doc.xpath('//@style').remove
puts doc.css('.post')
#=> <p class="post"><span>bla bla</span></p>

Edited to show that you can just call NodeSet#remove instead of having to use .each(&:remove).

Note that if you have a DocumentFragment instead of a Document, Nokogiri has a longstanding bug where searching from a fragment does not work as you would expect. The workaround is to use:

doc.xpath('@style|.//@style').remove

edited Mar 16 '14 at 02:26

answered May 23 '11 at 22:26

Phrogz

296,393
112
651
745

Use `doc.xpath('.//@style').remove` to remove all inline styles from all nodes, notice the `.` at the beginning as mentioned by @bricker below. Chain `.to_s` to get the resulting html string. – Gon Zifroni Mar 16 '14 at 01:08
Correction: Don't chain it but use `description.to_s` to get the resulting html string. If you don't want the `DOCTYPE` you should use the `Nokogiri::HTML.fragment` method instead, see http://stackoverflow.com/questions/4723344/how-to-prevent-nokogiri-from-adding-doctype-tags – Gon Zifroni Mar 16 '14 at 01:17

score 8 · Answer 2 · answered Oct 08 '14 at 01:50

8

This works with both a document and a document fragment:

doc = Nokogiri::HTML::DocumentFragment.parse(...)

or

doc = Nokogiri::HTML(...)

To delete all the 'style' attributes, you can do a

doc.css('*').remove_attr('style')

answered Oct 08 '14 at 01:50

Debajit

46,327
33
91
100

score 3 · Answer 3 · edited Mar 07 '19 at 14:16

3

I tried the answer from Phrogz but could not get it to work (I was using a document fragment though but I'd have thought it should work the same?).

The "//" at the start didn't seem to be checking all nodes as I would expect. In the end I did something a bit more long winded but it worked, so here for the record in case anyone else has the same trouble is my solution (dirty though it is):

doc = Nokogiri::HTML::Document.new
body_dom = doc.fragment( my_html )

# strip out any attributes we don't want
body_dom.xpath( './/*[@align]|*[@align]' ).each do |tag|
    tag.attributes["align"].remove
end

edited Mar 07 '19 at 14:16

Nakilon

34,866
14
107
142

answered Jul 11 '12 at 10:03

Pete Duncanson

3,208
2
25
35

1

This would also probably work: `body_dom.xpath('.//@class')` (notice the extra dot at the beginning of the xpath) – bricker Jan 29 '13 at 21:24
Nokogiri and/or LibXML2 have [a bug with XPath inside fragments](https://github.com/sparklemotion/nokogiri/issues/572). The current best workaround for fragments is as you note: instead of `//foo` you must use `foo|.//foo`. – Phrogz Mar 16 '14 at 02:22

Strip style attributes with nokogiri

3 Answers3

Linked