0

I'm trying to compact an existing XML file using Nokogiri. I have the following demo code:

#!/usr/bin/env ruby
require 'nokogiri'

doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <foo>
    <bar>test</bar>
  </foo>
</root>
XML

doc.write_xml_to($stdout, indent: 0)

I expected to see:

<?xml version="1.0" encoding="UTF-8"?>
<root><foo><bar>test</bar></foo></root>

but instead I saw:

<?xml version="1.0" encoding="UTF-8"?>
<root>
  <foo>
    <bar>test</bar>
  </foo>
</root>

I tried:

doc.write_to($stdout, indent: 0, save_with: Nokogiri::XML::Node::SaveOptions::AS_XML)

but that doesn't work either.

How can I remove the ignorable whitespaces?

the Tin Man
  • 158,662
  • 42
  • 215
  • 303
Aetherus
  • 8,720
  • 1
  • 22
  • 36
  • https://stackoverflow.com/questions/8406251/nokogiri-to-xml-without-carriage-returns may help. I was going to suggest sub but its not feasible if you have many levels of data. The only other I can think of is to use regex, but if you had long strings in XML attributes or values then that probably would not work either. – whodini9 Jul 16 '17 at 13:13
  • 1
    @whodini9 I'm not using a builder because my ultimate goal is to compact an existing XML file. Further more, according to the official documentation and source code of Nokogiri, `Node#write_xml_to` just calls `Node#write_to` with the option `save_with: DEFAULT_XML`. By the way, `AS_XML` is an alias of `DEFAULT_XML`. – Aetherus Jul 16 '17 at 13:18

2 Answers2

2

You can tell Nokogiri to ignore empty text nodes and then to output without indentation:

require 'nokogiri'

xml = <<EOT
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <foo>
    <bar>test</bar>
  </foo>
</root>
EOT

doc = Nokogiri::XML(xml) { |opts|
  opts.noblanks
  opts.strict.noblanks
}
doc.to_xml(:indent_text => '', :indent => 0)
# => "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
#    "<root>\n" +
#    "<foo>\n" +
#    "<bar>test</bar>\n" +
#    "</foo>\n" +
#    "</root>\n"
the Tin Man
  • 158,662
  • 42
  • 215
  • 303
1

Okay, I answer my own question.

Nokogiri does not remove the white spaces because Nokogiri doesn't know if the white spaces are ignorable or not (no DTD, no schema), so it keeps all the whitespace-only text as text nodes. I should remove them manually before writing the XML doc to the IO device.

#!/usr/bin/env ruby
require 'bundler'
Bundler.require :default

doc = Nokogiri.XML <<-XML.strip
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <foo>
    <bar>test</bar>
  </foo>
</root>
XML

# remove ignorable white spaces
doc.xpath('//text()').each do |node|
  node.content = '' if node.text =~ /\A\s+\z/m
end

doc.write_xml_to($stdout, indent: 0)

This is a big progress for me, but I still haven't reached my goal because the XML file I'm working on has inline self-closing tags, and there are whitespace-only text nodes between those tags that should not be compacted. I'm trying to figure out a way to handle this corner case now.

Aetherus
  • 8,720
  • 1
  • 22
  • 36