3

Given an XML string:

xml = "<org><people> <person>Joe Shmoe</person> <person>Bo Bob</person> 
    <person>New Guy</person> </people><other><![CDATA[ This string might 
have tags < >  < > and stuff, don't touch this ]]></other></org>"

How can I get rid of newlines and spaces between the tags, without affecting tag text, CDATA, etc?

Result should be:

xml = "<org><people><person>Joe Shmoe</person><person>Bo Bob</person><person>New Guy</person></people><other><![CDATA[ This string might 
have tags < >  < > and stuff, don't touch this ]]></other></org>"

UPDATE: This is what I've come up with so far- I just can't figure out how to have it ignore CDATA content...

xml.gsub(/>\s+</,"><")

Also, would much rather use an XML parser for this, as from what I hear regexing XML is a bad thing.

Community
  • 1
  • 1
Yarin
  • 173,523
  • 149
  • 402
  • 512

1 Answers1

0

Yes! What you want is canonicalization!

http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-canonicalize

LibXML-Ruby gem can do this. Since the docs are shitty and doesn't even say what it does, here are the specs

http://www.w3.org/TR/xml-c14n

This is used a lot in XML signing.

And yes! Using regular expressions on XML is bad.

BTW you can also print your xml object as a string, and set indentation:

http://xml4r.github.io/libxml-ruby/rdoc/classes/LibXML/XML/Document.html#method-i-to_s

Chloe
  • 25,162
  • 40
  • 190
  • 357