I want to parse xml file to print character console or winforms. it look like this,
<?xml version="1.0" encoding="UTF-8"?>
<kanjidic2>
<header>
<file_version>4</file_version>
<database_version>2015-093</database_version>
<date_of_creation>2015-04-03</date_of_creation>
</header>
<character>
<literal>亜</literal>
<codepoint>
<cp_value cp_type="ucs">4e9c</cp_value>
<cp_value cp_type="jis208">16-01</cp_value>
</codepoint>
</character>
<character>
<literal>唖</literal>
<codepoint>
<cp_value cp_type="ucs">5516</cp_value>
<cp_value cp_type="jis208">16-2</cp_value>
</codepoint>
</character>
...
</kanjidic2>
Where character in literal
tag is what is want to print it. The character itself is encoded in UTF8(the provider said).
I used this code to parse and print it in console.
class Program
{
static void Main(string[] args)
{
Console.OutputEncoding = Encoding.UTF8;
foreach (Kanji kanji in Parse())
{
Console.WriteLine(kanji.Character);
}
Console.ReadKey();
}
private static IEnumerable<Kanji> Parse()
{
var doc = new XmlDocument();
doc.Load("kanjidic2.xml");
XmlNodeList nodes = doc.DocumentElement.SelectNodes("/kanjidic2/character");
foreach (XmlNode node in nodes)
{
yield return new Kanji { Character = node.SelectSingleNode("literal").InnerText };
}
}
}
public class Kanji
{
public string Character { get; set; }
}
When I ran program, it started print character but it isn't character that I've seen it in literal
(and I think none can read it).
I tried change console output encoding to Unicode
this time it print properly character.
The question is why console doesn't print properly character when I set output encoding as UTF8?
Is that because it read character that is encoded in UTF8 and store that character in memory as Unicode(which mean to UTF16 in .net?)? if so why it can't convert character back to UTF8 as I set it at first time.