Nokogiri XML Parser with Bad Attribute Values

Question

I can't find any good documentation on the difference between how Nokogiri (or by implication libxml) handles attribute values in XML vs. HTML. One of our projects was still using the now defunct Hpricot gem, mostly because of it's lax acceptance of attributes.

The crux of the problem seems to be that our XML input has both unquoted and missing attribute values. I'm not a spec lawyer, but I gather that most of the HTML variants allow these attribute patterns and XML does not.

If Nokogiri (or libxml) is going to be strict, shouldn't there be an option to make it less strict on attributes? If I could get the HTML parser not to strip the namespaces, I could maybe use that.

We can't be the only team that has XMLish formats that aren't exactly fish or fowl but something in between. If we could fix it at the source we might do that, but in the meantime we have to handle the format as it is.

This is my hack to fix the attributes before sending it to Nokogiri:

ATTR_RE = /[^\s=>]+\s*(?:=(?:[^\s'">]+|\s*"[^"]*"|\s*'[^']*'))?/mo
ELEMENT_RE = /(<\s*[:\w]+)((?:\s+#{ATTR_RE})*)(\s*>)/mo

Nokogiri::XML(
  data.gsub(ELEMENT_RE) do |m|
    open, close = $1, $3
    ([open] +
     $2.scan(ATTR_RE).map do |atr|
       if atr =~ /=[ '"]/
         atr
       elsif atr =~ /=/
         "#{$`.strip}=\"#{$'.strip}\""
       else
         "#{atr.strip}=\"#{atr.strip}\""
       end
     end
    ) * ' ' + close
  end
)

Please read "[mcve]". We'd like to help, but you're not giving us enough information. Show us the minimal input data necessary to demonstrate what you're trying to do, along with the desired output. "XML-like" isn't close enough to work with libXML, since the XML spec is pretty rigid about the format. As you're doing you have to figure out how to mangle the input prior to passing it to Nokogiri, not afterwards, otherwise fixups will occur which can really mess with Nokogiri's understanding of the document. How you're going about it isn't idiomatic Ruby though. — the Tin Man, Aug 04 '16 at 19:40
Regular expressions aren't sufficient to process XML/HTML so it's likely they can't do it for an XML-like document either. http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags is a famous posting about this. Possibly writing a SAX processor would make it easier than writing/maintaining a reliable regex. — the Tin Man, Aug 04 '16 at 19:46

Nokogiri XML Parser with Bad Attribute Values

0 Answers0