2

I recently switched to use the OmniXML included with Delphi XE7, to allow targeting iOS. The XML data comes from a cloud service and includes nodes with base64 encoded binary data.

Now I get this exeception "Invalid Unicode Character value for this platform" when calling XMLDocument.LoadFromStream, and it seems to be this base64 linebreak sequence that fails: 

The nodes with base64 data looks similar to this:

<data>TVRMUQAAAAIAAAAAFFo3FAAUAAEA8AADsAAAAEAAAABAAHAAwABgAAAAAAAAAAAQEBAAAAAAAA&#xD;
AAMQAAABNUgAAP/f/AAMABAoAAAAEAAAAAEVNVExNAAAAAQAAAAAUWjcUABQAAQD/wAA&#xD;
AAA=</data>

I traced it down to these lines in XML.Internal.OmniXML:

  psCharHexRef:
    if CharIs_WhiteSpace(ReadChar) then
      raise EXMLException.CreateParseError(INVALID_CHARACTER_ERR, MSG_E_UNEXPECTED_WHITESPACE, [])
    else
    begin
      case ReadChar of
        '0'..'9': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 48);
        'A'..'F': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 65 + 10);
        'a'..'f': CharRef := LongWord(CharRef shl 4) + LongWord(Ord(ReadChar) - 97 + 10);
        ';':
          if CharIs_Char(Char(CharRef)) then
          begin
            Result := Char(CharRef);
            Exit;
          end
          else
            raise EXMLException.CreateParseError(INVALID_CHARACTER_ERR, MSG_E_INVALID_UNICODE, []);

It is the exception in the last line that is raised because CharIs_Char(#13) is false (where #13 is the value of CharRef read from &#xD;)

How do I solve this?

J...
  • 30,968
  • 6
  • 66
  • 143
Hans
  • 2,220
  • 13
  • 33

1 Answers1

3

This is clearly a bug in OmniXML. It looks like the developers were trying to implement XML1.0 which states :

...XML processors MUST accept any character in the range specified for Char.

Character Range

[2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

The implementation of CharIs_Char, however looks like :

function CharIs_Char(const ch: Char): Boolean;
begin
  // [2] Char - any Unicode character, excluding the surrogate blocks, FFFE, and FFFF
  Result := not Ch.IsControl;
end;

This is excluding all control characters, which include #x9(TAB), #xA(LF) and #xD(CR). In fact, since XML strips (or optionally replaces with LF) carriage return literals during parsing, the only way to include an actual carriage return is using a character reference in an entity value literal (section 2.3 of the specification).

This seems like a showstopper and should be submitted as a QC report.

J...
  • 30,968
  • 6
  • 66
  • 143
  • Setting `DefaultDOMVendor := sAdom4XmlVendor;` solved the problem. Who should have known that linebreaks are such an uncommon feature of an XML document that nobody have discovered that bug in OmniXML... – Hans May 05 '15 at 06:53