1

I have an XML document with a default namespace and another prefix defined on the root:

<r xmlns="a://foo" xmlns:b="a://bar" x="y"><!-- content not using b:* --></r>

Using Nokogiri I have already gone through the document and removed elements and attributes using the b namespace. Now I want to modify the document so that when output it does not have the b namespace, i.e.

<r xmlns="a://foo" x="y"><!-- content not using b:* --></r>

What does not work

If I use remove_namespaces! I lose even the default namespace, which I do not want:

<r x="y"><!-- content not using b:* --></r>

I can select the namespace using XPath, but Nokogiri::XML::Namespace does not inherit from Node and has no remove method:

doc.at('//namespace::*[name()="b"]')
#=> #<Nokogiri::XML::Namespace:0x8104118c prefix="b" href="a://bar">

doc.at('//namespace::*[name()="b"]').remove
#=> NoMethodError: undefined method `remove' for #<Nokogiri::XML::Namespace:0x81025acc prefix="b" href="a://bar">

doc.xpath('//namespace::*[name()="b"]').remove; puts doc
#=> <r xmlns="a://foo" xmlns:b="a://bar" x="y"><!-- content not using b:* --></r>

The root element does not include the namespace declaration as an attribute that could be removed:

doc.root.attributes
#=> {"x"=>#<Nokogiri::XML::Attr:0x8103dba4 name="x" value="y">} 

What sort of works

As the document is small, I will accept any solution that creates a new copy of the document without the namespace instead of mutating the existing one.

The best solution I've got so far is

doc.remove_namespaces!
doc.root.add_namespace(nil,'foo')

…but this nuclear option will also remove any namespaces on descendants of the root, which is undesirable.

Phrogz
  • 296,393
  • 112
  • 651
  • 745
  • do you want to remove only the text `xmlns:b="bar"` from the root element? why not regex? – Raj Jun 28 '14 at 14:14
  • @emaillenin Yes, I do want to remove only that. Text modification on the result is an option, though (a) it requires serializing, munging, and then re-parsing the document, and (b) [because this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). (However unlikely, for example, that string could be the value of another attribute on the root element.) – Phrogz Jun 28 '14 at 14:21

2 Answers2

0

In Nokogiri you can canonicalize the document, skipping just the namespace declaration, like so:

result = doc.canonicalize(nil,nil,1) do |o,_|
  !o.is_a?(Nokogiri::XML::Namespace) || o.href!="a://bar"
end

This returns a string (not a new document). If you want a new document you can doc2 = Nokogiri.XML(result).

Note that while Nokogiri::XML::Node also has a canonicalize method, it does not accept the block to decide whether or not to keep items. You must call this on the document itself.

The third parameter is required to include comments in the canonicalization. I do not know what the first two options do, except that the runtime will segfault if you pass a 1 for the second parameter.

However, this answer also strips whitespace from the document. I will not be accepting this.

Phrogz
  • 296,393
  • 112
  • 651
  • 745
-1

You could select the root element and remove it's attribute like this:

doc.css('r')[0].attributes['xmlns:b'].remove

Quoting one of your own answers :)

Using the document as HTML,

irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> doc = '<r xmlns="foo" xmlns:b="bar" x="y"><!-- content not using b:* --></r>'
=> "<r xmlns=\"foo\" xmlns:b=\"bar\" x=\"y\"><!-- content not using b:* --></r>"
irb(main):004:0> xml = Nokogiri::HTML(doc)
=> #<Nokogiri::HTML::Document:0x3fe4544e6268 name="document" children=[#<Nokogiri::XML::DTD:0x3fe4544e3158 name="html">, #<Nokogiri::XML::Element:0x3fe4544e2834 name="html" children=[#<Nokogiri::XML::Element:0x3fe4544e2410 name="body" children=[#<Nokogiri::XML::Element:0x3fe4544e2050 name="r" attributes=[#<Nokogiri::XML::Attr:0x3fe4544df01c name="xmlns" value="foo">, #<Nokogiri::XML::Attr:0x3fe4544dfcec name="xmlns:b" value="bar">, #<Nokogiri::XML::Attr:0x3fe4544dfd00 name="x" value="y">] children=[#<Nokogiri::XML::Comment:0x3fe4544df418 " content not using b:* ">]>]>]>]>
irb(main):005:0> xml.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><r xmlns=\"foo\" xmlns:b=\"bar\" x=\"y\"><!-- content not using b:* --></r></body></html>\n"
irb(main):006:0> xml.css('r')[0].attributes['xmlns:b'].remove
=> #<Nokogiri::XML::Attr:0x3fe4544dfcec name="xmlns:b" value="bar">
irb(main):007:0> xml.to_s
=> "<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n<html><body><r xmlns=\"foo\" x=\"y\"><!-- content not using b:* --></r></body></html>\n"
irb(main):008:0>
Community
  • 1
  • 1
Raj
  • 22,346
  • 14
  • 99
  • 142
  • 1
    Did you test this? It doesn't work for me. The `attributes` does not include namespace declarations in my install. (Also: `doc.at('r')` or `doc.root` is a little simpler.) – Phrogz Jun 28 '14 at 14:28
  • Yes, but I used `Nokogiri::HTML` to initialize ;) instead of XML – Raj Jun 28 '14 at 14:36
  • Interesting. I can't accept this given what `HTML()` does to the content, but I'll not downvote either. – Phrogz Jun 28 '14 at 14:40