2

I am deserializing the following XML file. Using XML serializer with VSTS 2008 + C# + .Net 3.5.

Here is the XML file.

<?xml version="1.0" encoding="utf-8"?>
<Person><Name>=b?olu</Name></Person>

Here is the screen snapshot for the display of the XML file and binary format of the XML file,

alt text

alt text

If there are some solutions to accept such characters, it will be great! Since my XML file is big, and if such characters are really invalid and should be filtered, I want to keep remaining content of XML file after deserialization.

Currently XML deserialization fails with InvalidOperationException and the whole XML file information will be lost.

Actually, when open this XML file in VSTS, there is error like this, Error 1 Character '?', hexadecimal value 0xffff is illegal in XML documents. I am confused since in the binary form, there is no 0xffff values.

Any solutions or ideas?

EDIT1: here is my code which is used to deserialize XML file,

    static void Foo()
    {
        XmlSerializer s = new XmlSerializer(typeof(Person));
        StreamReader file = new StreamReader("bug.xml");
        s.Deserialize(file);
    }

public class Person
{
    public string Name;
}
Glorfindel
  • 21,988
  • 13
  • 81
  • 109
George2
  • 44,761
  • 110
  • 317
  • 455

3 Answers3

1

Does this style help?

<name>
   <![CDATA[
     =b?olu
   ]]>
</name>

Either that or encoding should do the trick.

EDIT: Found this page: http://www.eggheadcafe.com/articles/system.xml.xmlserialization.asp. Specifically, this code for deserialization:

public Object DeserializeObject(String pXmlizedString)
 {
     XmlSerializer xs = new XmlSerializer(typeof(Automobile));
     MemoryStream memoryStream = new MemoryStream(StringToUTF8ByteArray(pXmlizedString));
     XmlTextWriter xmlTextWriter = new XmlTextWriter(memoryStream, Encoding.UTF8);
     return xs.Deserialize(memoryStream);
  } 

That part about "StringToUTF8ByteArray" and "Encoding.UTF8" look strangely absent from yours. I'm guessing .NET doesn't like reading the encoding of your actual XML file...?

Glenn
  • 5,334
  • 4
  • 28
  • 31
  • Thanks Glenn, the issue is the XML file is my application input, I cannot change it in any way. I need to find a way to filter out invalid characters and continue to parse (deserialize) remaining ones. If there are some ways to accept such characters, it will be greater! – George2 Aug 31 '09 at 06:20
  • 1
    Sounds like you need a either a SAX parser (http://stackoverflow.com/questions/127869/sax-vs-xmltextreader-sax-in-c), or you need to pre-process the XML yourself and strip/encode problem characters with regex or similar. You might have to dig around for a regex example. I'm not familiar enough with it to give one here. – Glenn Aug 31 '09 at 06:26
  • 1
    Oh right, even with a SAX parser, you still need to sanitize characters. So you might have to overload it. – Glenn Aug 31 '09 at 06:26
  • Catch InvalidOperationException during XML serialization to check whether XML file is valid or not is a good solution? Or not a good solution? – George2 Aug 31 '09 at 06:31
  • Actually, when open this XML file in VSTS, there is error like this, Error 1 Character '?', hexadecimal value 0xffff is illegal in XML documents. I am confused since in the binary form, there is no 0xffff values. – George2 Aug 31 '09 at 06:43
  • 1
    Catching exceptions isn't a good solution because it won't allow you to continue parsing. Your XML *is* invalid. So you need to pre-process it somehow. Which is more difficult? Loading the file as text, pre-processing, then loading XML, or changing the original source so that it generates valid XML? – Glenn Aug 31 '09 at 07:00
  • Hi Glenn, I did some research and find 0x EF BF BF is valid UTF-8 encoding for character 0xFFFF, why the XML deserializer thinks it is invalid? – George2 Aug 31 '09 at 11:47
  • I'm just speculating here that it is invalid because XML is a text format, and you require a binary JPG to show us the correct view of the data *and* your parsing is failing. In doing so I found a page that might help. Added to my answer. – Glenn Aug 31 '09 at 20:00
1

Have you tried the DataContractSerializer instead? I've encountered an interesting situation, when someone copy and pasted some word or excel stuff into my web application: the string contained some invalid control characters (such as vertical tab). To my surprise this was serialized when sending it to a WCF service and even read back 100% original when requesting it. The pure .net environment did not have a problem with this, so I assume that the DataContractSerializer can handle such stuff (which is IMHO a violation of XML spec, however).

We had another Java client accessing the same service - it failed when receiving this record...

[Edit after ugly format in my comment below]

Try this:

DataContractSerializer serializer = new DataContractSerializer(typeof(MyType));
using (XmlWriter xmlWriter = new XmlTextWriter(filePath, Encoding.UTF8)) 
{ 
  serializer.WriteObject(xmlWriter, instanceOfMyType);
}
using (XmlReader xmlReader = new XmlTextReader(filePath))
{
  MyType = serializer.ReadObject(xmlReader) as MyType;
}

The comment of the second Marc is about DataContractSerializers habit to make XmlElements instead of XmlAttributes:

<AnElement>value</AnElement> 

instead of

<AnElement AnAttribute="value" />
Marc Wittke
  • 2,991
  • 2
  • 30
  • 45
  • But I am not using WCF, can I use DataContractSerializer? – George2 Aug 31 '09 at 06:52
  • Marc, what do you mean "data doesn't involve attributes"? Could you show a sample here? – George2 Aug 31 '09 at 11:45
  • Hi Dabblernl, you mentioned -- "just read the documentation", but I did not find anything about URL links or document titles you mentioned, appreciate if you could recommend me a document to read. – George2 Aug 31 '09 at 11:46
  • Try this: DataContractSerializer serializer = new DataContractSerializer(typeof(MyType)); using (XmlWriter xmlWriter = new XmlTextWriter(filePath, Encoding.UTF8)) { serializer.WriteObject(xmlWriter, instanceOfMyType); } using (XmlReader xmlReader = new XmlTextReader(filePath)) { MyType = serializer.ReadObject(xmlReader) as MyType; } The comment of the second Marc is about DataContractSerializers habit to make XmlElements instead of XmlAttributes (value instead of ) – Marc Wittke Sep 14 '09 at 12:37
0

The "invalid characters" look like they might be intended to be encoded Unicode characters. Perhaps they wrong encoding is being used?

Can you ask the originators of this document what character they meant to include at that location? Perhaps ask them how they generated the document?

John Saunders
  • 160,644
  • 26
  • 247
  • 397