1

I'm trying to read a file (not a XML, but the structure is similar), but i'm getting this Exception:

'┴', hexadecimal value 0x15, is an invalid character. Line 8, position 7.

and the file have a lot of this symbols, that I can't replace because I can't modify the content of the file for my purposes...

That's the code:

try
{
    XDocument doc = new XDocument(new XDeclaration("1.0", "utf-16", "yes"));
    doc = XDocument.Load(arquivo);
}
catch (Exception e)
{
    MessageBox.Show(e.Message.ToString());
}

and that's some part of the file:

<Codepage>UTF16</Codepage>
<Segment>0000016125
    <Control>0003┴300000┴English(U.S.)PORTUGUESE┴┴bla.000┴webgui\messages\xsl\en\blabla\blabla.xlf
    </Control>
    <Source>To blablablah the   firewall to blablablah local IP address.    </Source>
    <Target>Para blablablah a uma blablablah local específico.  </Target>
</Segment>

Note: The file don't have the encode xml declaration.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Aly
  • 55
  • 2
  • 9
  • Why are you trying to read non-XML file using XML parser? – MarcinJuraszek Jan 31 '14 at 19:39
  • Hi @MarcinJuraszek , because of this: http://stackoverflow.com/questions/21465568/c-sharp-alternative-to-readline , and the structure is so similar, Thanks – Aly Jan 31 '14 at 19:41
  • The structure may be similar, but the one that works is valid XML, and the one that doesn't work is not valid XML. – John Saunders Jan 31 '14 at 20:04

1 Answers1

2

This XML is pretty bad;

  1. You have <Segment>0000016125 in there which, while not technically illegal (it is a Text node), is just kind of odd.
  2. Your <Control> element contains invalid characters without an XML CDATA section

You can manually normalize the XML or do it in C# via string manipulation, or RegEx, or something similar.

In your simple example, only the <Control> element has invalid characters; therefore it is relatively simple to fix it and add a CDATA section using the string.Replace() method, to make it look like this:

<Control><![CDATA[0003┴300000┴English(U.S.)PORTUGUESE┴┴bla.000┴webgui\messages\xsl\en\blabla\blabla.xlf]]></Control>

Then you can load the good XML into your XDocument using XDocument.Parse(string xml):

string badXml = @"
    <temproot>
        <Codepage>UTF16</Codepage>
        <Segment>0000016125
            <Control>0003┴300000┴English(U.S.)PORTUGUESE┴┴bla.000┴webgui\messages\xsl\en\blabla\blabla.xlf</Control>
            <Source>To blablablah the   firewall to blablablah local IP address.    </Source>
            <Target>Para blablablah a uma blablablah local específico.  </Target>
        </Segment>
    </temproot>";

// assuming only <control> element has the invalid characters
string goodXml = badXml
    .Replace("<Control>", "<Control><![CDATA[")
    .Replace("</Control>", "]]></Control>");

XDocument xDoc = XDocument.Parse(goodXml);
xDoc.Declaration = new XDeclaration("1.0", "utf-16", "yes");

// do stuff with xDoc
Dmitriy Khaykin
  • 5,238
  • 1
  • 20
  • 32