0

I want to parse xml file to print character console or winforms. it look like this,

<?xml version="1.0" encoding="UTF-8"?>
<kanjidic2>
<header>
  <file_version>4</file_version>
  <database_version>2015-093</database_version>
  <date_of_creation>2015-04-03</date_of_creation>
</header>
<character>
  <literal>亜</literal>
  <codepoint>
    <cp_value cp_type="ucs">4e9c</cp_value>
    <cp_value cp_type="jis208">16-01</cp_value>
  </codepoint>
</character>
<character>
  <literal>唖</literal>
  <codepoint>
    <cp_value cp_type="ucs">5516</cp_value>
    <cp_value cp_type="jis208">16-2</cp_value>
  </codepoint>
</character>

...
</kanjidic2>

Where character in literal tag is what is want to print it. The character itself is encoded in UTF8(the provider said). I used this code to parse and print it in console.

class Program
{
    static void Main(string[] args)
    {
        Console.OutputEncoding = Encoding.UTF8;

        foreach (Kanji kanji in Parse())
        {
            Console.WriteLine(kanji.Character);
        }

        Console.ReadKey();
    }

    private static IEnumerable<Kanji> Parse()
    {
        var doc = new XmlDocument();
        doc.Load("kanjidic2.xml");

        XmlNodeList nodes = doc.DocumentElement.SelectNodes("/kanjidic2/character");

        foreach (XmlNode node in nodes)
        {
            yield return new Kanji { Character = node.SelectSingleNode("literal").InnerText };
        }
    }
}

public class Kanji
{
    public string Character { get; set; }
}

When I ran program, it started print character but it isn't character that I've seen it in literal (and I think none can read it). I tried change console output encoding to Unicode this time it print properly character.

The question is why console doesn't print properly character when I set output encoding as UTF8?

Is that because it read character that is encoded in UTF8 and store that character in memory as Unicode(which mean to UTF16 in .net?)? if so why it can't convert character back to UTF8 as I set it at first time.

witoong623
  • 1,179
  • 1
  • 15
  • 32
  • Does the xml file have an encoding at the top? Example: `` If not, does anything happen when you add it? You have not specified an encoding for loading the xml file, so the encoding probably do not match even though you've set the `Console.OutputEncoding`. – Ryan Apr 04 '15 at 17:56
  • @Ryan Yes, it has. I think it very common that xml file will declare encoding at top, so I didn't add it, so I will edit it :) and can you tell me how to specify encoding when load xml file through XmlDocument.Load, I've looked for it but I couldn't find how to do that. – witoong623 Apr 04 '15 at 18:02

2 Answers2

0

try to load the xml in a UTF8 byte and then load the xml file :

 byte[] encodedString = Encoding.UTF8.GetBytes(xmlString);
using (MemoryStream ms = new MemoryStream(encodedString))
{
    ms.Flush();
    ms.Position = 0;
   XmlDocument xmlDoc = new XmlDocument();
   xmlDoc.Load(ms);
}

if you have a file instead of an xml string just load first as rgular file like this

 var xmlString= File.ReadAllText(FilePath,Encoding.Default)
Coder1409
  • 523
  • 4
  • 12
  • Is `xmlString` is string that are readed by StreamReader.ReadToEnd? I'll try it tomorrow :) – witoong623 Apr 04 '15 at 18:10
  • just see updated answer , load the string from the file as it is with default encoding – Coder1409 Apr 04 '15 at 18:12
  • I've tested it by use XmlDocument.LoadXml, unfortunately, I got `XmlException`(https://msdn.microsoft.com/en-us/library/set3a0zx(v=vs.90).aspx) even it's totaly valid :( anyway, thanks. – witoong623 Apr 05 '15 at 02:07
  • @witoong623 if it raised an `XmlException` then the file probably isn't valid, but you need to check the exception details and code. Do not assume the file is valid. Many characters need an escape sequence or other possible issues with the file. – Ryan Apr 06 '15 at 01:09
  • @Ryan Hi, Can you try run the code that have `XmlException`, actually, this is first time I work with xml file, I tried to removed some node that I thought it cause the problem by looking at message but end up have nothing in file, anyway I can parse it by use `XmlDocument.Load` instead `XmlDocument.LoadXml` which throw exception. This is code and xml file if you please, [code](https://gist.github.com/witoong623/6ed4d355b0b4fc18b6f3), [xml](http://www.csse.monash.edu.au/~jwb/kanjidic2/kanjidic2.xml.gz). – witoong623 Apr 06 '15 at 16:44
  • @witoong623 Try putting try/catch around the doc.LoadXml() method and checking the exception. I was able to load the xml file without an issue, but the file is large. – Ryan Apr 07 '15 at 13:18
0

There are several potential problems you may encounter here.

  1. The Console has issues displaying other character sets, such as Kanji, without additional effort or code. You can try changing the Console font to a TrueType font such as Consolas or Courier New. Or for UTF-32, look at the code samples here.
  2. Your xml file is UTF8 without BOM, and if this is static (will not change), then you are probably better off specifying it in your code. Your gist is using Encoding.Default but when I changed it to Encoding.UTF8 the Kanji string was correct. I looked at methods for detecting the encoding, but you need to decide if your XML file will change encoding.
  3. I looked at the first <literal>亜</literal> in a hex editor and it was E4 BA 9C, but when I pasted the character into Visual Studio, it was just E4 9C. I believe BA is a combining character. If you have the wrong encoding, you may see 亜. If you are not using a TTF font, you will see crazy characters. Even using Consolas on my system, the E4 9C string displayed a boxed question mark, but when I copied and pasted it was the correct character.
Community
  • 1
  • 1
Ryan
  • 7,835
  • 2
  • 29
  • 36