I can't find any good documentation on the difference between how Nokogiri (or by implication libxml) handles attribute values in XML vs. HTML. One of our projects was still using the now defunct Hpricot gem, mostly because of it's lax acceptance of attributes.
The crux of the problem seems to be that our XML input has both unquoted and missing attribute values. I'm not a spec lawyer, but I gather that most of the HTML variants allow these attribute patterns and XML does not.
If Nokogiri (or libxml) is going to be strict, shouldn't there be an option to make it less strict on attributes? If I could get the HTML parser not to strip the namespaces, I could maybe use that.
We can't be the only team that has XMLish formats that aren't exactly fish or fowl but something in between. If we could fix it at the source we might do that, but in the meantime we have to handle the format as it is.
This is my hack to fix the attributes before sending it to Nokogiri:
ATTR_RE = /[^\s=>]+\s*(?:=(?:[^\s'">]+|\s*"[^"]*"|\s*'[^']*'))?/mo
ELEMENT_RE = /(<\s*[:\w]+)((?:\s+#{ATTR_RE})*)(\s*>)/mo
Nokogiri::XML(
data.gsub(ELEMENT_RE) do |m|
open, close = $1, $3
([open] +
$2.scan(ATTR_RE).map do |atr|
if atr =~ /=[ '"]/
atr
elsif atr =~ /=/
"#{$`.strip}=\"#{$'.strip}\""
else
"#{atr.strip}=\"#{atr.strip}\""
end
end
) * ' ' + close
end
)