0

I have an XML with invalid hexadecimal characters. I've read this, this and this and any other links given but failed to make it work.

I'm using XmlReader - XmlDocument, XDocument and XmlTextReader are not my options, because there are XML files with more than 500GB size and 500 million in volume. XMLReader is my best choice because of its "forward" approach, and not loading into the memory all of the XML details. Also, because of this, I can't have the XML file recreated or loaded just to replace the invalid characters.

Here's the code that I'm working on:

case XmlNodeType.Element:
if (xmlReader.Name.Equals("ROW"))
{
    DataRow dataRow = xmlDataTable.NewRow();
    XmlReader row = XmlReader.Create(xmlReader.ReadSubtree(), new XmlReaderSettings { CheckCharacters = false
                                                                            , ValidationType = ValidationType.None });

    // iterate on elements inside ROW
    // these are the column items
    if (row != null)
    {
        while (row.Read())
        {

            if (row.IsStartElement())
            {

                if (!row.Name.Equals("ROW"))
                {

                    string columnName = row.Name;
                    //row = XmlReader.Create(CleanInvalidXmlChars(row.ReadInnerXml()));

                    row.Read();
                    string value = CleanInvalidXmlChars(row.Value.ToString());

                    // all other logics ...

The exception raises on the row.Read(); statement. Here's a sample XML file I'm reading:

<?xml version="1.0" encoding="UTF-8"?>
<MFAINSBRP>
<ROW>
    <INSTITUTION_CODE>828  </INSTITUTION_CODE>
    <BRANCH_CODE>GJ102</BRANCH_CODE>
    <BRANCH_NAME>                                   </BRANCH_NAME>
    <BRANCH_NAME_FRENCH>                                   </BRANCH_NAME_FRENCH>
    <LANGUAGE_CODE>E</LANGUAGE_CODE>
    <ADDR_NO>815412</ADDR_NO>
    <FAX_AREA>0</FAX_AREA>
    <FAX_PHONE>0</FAX_PHONE>
    <AREA_CODE>0</AREA_CODE>
    <PHONE_NO>0</PHONE_NO>
    <STATUS>A</STATUS>
    <PHONE_EXT>0</PHONE_EXT>
</ROW>
<!--ALL OTHER RECORDS-->
</MFAINSBRP>

Right now, I'm stuck on making this work.

EDIT:

The sample XML file is the record that makes my code break. I copied at pasted it here from Notepad++ but it doesn't show the invalid characters. Here's the image of how it looks in Notepad++:

enter image description here

How I create the xmlReader object is just this simple statement:

using (xmlReader = XmlReader.Create(filePath, new XmlReaderSettings { CheckCharacters = false }))
Community
  • 1
  • 1
Dustine Tolete
  • 461
  • 1
  • 7
  • 19
  • Where in the file is it failing? I'd expect the exception to show you the line/column. – Jon Skeet Oct 09 '15 at 09:42
  • Does this sample XML contain an example of problem input that will break your code? If so, could you highlight it in some way? If not, could you create a sample that *does* exhibit the problem? – Damien_The_Unbeliever Oct 09 '15 at 09:44
  • http://stackoverflow.com/questions/5742543/an-invalid-xml-character-unicode-0xc-was-found might be related. The file is errorneous, so you need to pre-process it and remove the offending characters first. You might be able to do that with an intermediate stream. – jishi Oct 09 '15 at 09:45
  • This should be somewhat related to reading into a database. If so, then wouldn't it be more feasible to use the XML reading capabilities of your backend database. "Invalid hexadecimal characters" - how can an hexadecimal character be invalid. – Cetin Basoz Oct 09 '15 at 09:47
  • @Damien_The_Unbeliever - the sample xml i provided is the one that raises the exception. please see the edit of my question. – Dustine Tolete Oct 09 '15 at 09:50
  • @CetinBasoz - can't do anything about the xml files - it's provided by the business and can't be changed, so we're the one to adjust. – Dustine Tolete Oct 09 '15 at 09:51
  • @JonSkeet - the sample XML in my question is the record that fails. see the edit part of my question. – Dustine Tolete Oct 09 '15 at 09:52
  • 1
    The XML file you provided originally is fine - because you've pasted it with spaces instead of anything else. Now that you've shown it in Notepad++ we can see why it's broken... and yes, it's just not valid XML. Where are you getting the XML from? Can you fix that? Where possible, you're much better off making sure you never have invalid data instead of trying to clean it up later. – Jon Skeet Oct 09 '15 at 09:54
  • @JonSkeet - no, we can't fix it. it's coming directly from the business. we have the option of recreating the xml file after cleaning the invalid characters, but it would take really long time and might consume significant machine resources when loading a 500GB xml file (that's just one - we have 200GB to 300GB other files - these are Day 0 files) – Dustine Tolete Oct 09 '15 at 09:56
  • What is "the business"? Whichever business it is, is creating invalid XML. They should fix that. You could try writing a `TextReader` which just skips these characters, but again, this really isn't the right place to fix this. *Any* decent parser that tries to consume this should fail. Have you reported this as a bug to the file producer yet? To put it another way: what would you do if the file was broken in some other way, e.g. it couldn't propagate any non-ASCII characters? – Jon Skeet Oct 09 '15 at 09:58
  • @JonSkeet - by the business, i mean is the client. we already brought this to them, and mentioned that they'll come back regarding this issue. but probably, they won't fix this. most of the time we're the one's to adjust, so this is what we're preparing. – Dustine Tolete Oct 09 '15 at 10:02
  • Oh you mean non-printable characters. You could directly replace all those non-printable characters with a space. – Cetin Basoz Oct 09 '15 at 10:02
  • Again, what would you do if they provided you with broken data in another way? How far are you willing to add extra cost (whether that's performance, maintainabilty or both) to *your* codebase to work around *their* bug? If they provided you with an "XML" file using braces instead of angle brackets, what would you do? Treat this the same way. – Jon Skeet Oct 09 '15 at 10:03
  • @CetinBasoz - "could", but can't. there will be very large files to consume. another process like replacing them will take significant amount of resources for the machine. – Dustine Tolete Oct 09 '15 at 10:04
  • Editing a 500Gb file in place wouldn't take much resources IMHO but should try that. – Cetin Basoz Oct 09 '15 at 10:05
  • @JonSkeet - well, right now, what i'm trying to find is having a work around by not recreating the file. if all else fails, we'll do this. – Dustine Tolete Oct 09 '15 at 10:05
  • In order to work out a feasible workaround, we'll need to know how you're creating the original `xmlReader`, btw. – Jon Skeet Oct 09 '15 at 10:06
  • @CetinBasoz - it's not just one 500GB file, we have other very large files. and that's for Day 0 events. – Dustine Tolete Oct 09 '15 at 10:06
  • @JonSkeet - i edited the question, i provided how the `xmlReader` instance is created. – Dustine Tolete Oct 09 '15 at 10:08
  • People think they're so smart, using XML as a data interchange standard so everyone gets more value from the information. The first thing to do is to stop calling it XML. Make it very clear whenever discussing this data that it is non-XML data in a proprietary format that is very expensive to process. Slowly, they will start to realise they are being stupid. – Michael Kay Oct 09 '15 at 14:04

1 Answers1

1

It's unclear to me why CheckCharacters = false isn't fixing the problem for you, and as I've mentioned the far, far better fix is to get the data in a clean fashion to start with.

However, you can work around this by replacing each invalid character with a replacement in the TextReader that the XmlReader uses. Here's a short but complete example:

using System;
using System.IO;
using System.Xml;

class Test
{
    static void Main()
    {
        var text = "<foo>\0</foo>";
        var reader = XmlReader.Create(
             new XmlReplacingReader(new StringReader(text), ' '));
        while (reader.Read())
        {
            Console.WriteLine(reader.NodeType);
        }
    }
}

public sealed class XmlReplacingReader : TextReader
{
    private readonly TextReader original;
    private readonly char replacementChar;

    public XmlReplacingReader(TextReader original, char replacementChar)
    {
        this.original = original;
        this.replacementChar = replacementChar;
    }

    override public int Peek()
    {
        int ret = original.Peek();
        return MaybeReplace(ret);
    }

    override public int Read()
    {
        int ret = original.Read();
        return MaybeReplace(ret);        
    }

    override public int Read(char[] buffer, int index, int count)
    {
        int ret = original.Read(buffer, index, count);
        for (int i = 0; i < ret; i++)
        {
            buffer[i + index] = MaybeReplace(buffer[i + index]);
        }
        return ret;
    }

    protected override void Dispose(bool disposing)
    {
        if (disposing)
        {
            original.Dispose();
        }
    }

    public override void Close()
    {
        original.Close();
    }

    private int MaybeReplace(int x)
    {
        return x < 0 ? x : MaybeReplace((char) x);
    }

    private char MaybeReplace(char c)
    {
        return (c >= ' ' || c == '\r' || c == '\n' || c == '\t') ? c : replacementChar;
    }
}

This relies on you being able to create a TextReader for the file, of course - which you can do with File.OpenText if you know the encoding. If you need to handle other encodings, you may need a more cunning solution, but this should get you started.

Note that this approach replaces the invalid characters. If you want to remove them instead, it becomes harder and probably less efficient, as the bulk Read method would need to find out whether or not it needs to remove characters, do the removal, and then return a different value. The code would be a lot trickier - I'm hoping you don't need it.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • 1
    yes, been also wondering why `CheckCharacters` is not working for this case. i've done many other scenarios where it functions right, but right now, i don't even know what i did wrong. been looking at your code snippet for a while now, will try to implement this and update anything. – Dustine Tolete Oct 09 '15 at 11:34