-1

I have a folder with inside an XML file like this:

<?xml version="1.0" encoding="UTF-8"?>
<cities>
   <result>
      <city_id>-3870534</city_id>
      <country>mx</country>
      <name>Santa Bárbara</name>
      <nr_hotels>0</nr_hotels>
      <translations>
         <language>en-gb</language>
         <name>Santa Bárbara</name>
      </translations>
      <translations>
         <language>ru</language>
         <name>Санта-Барбара</name>
      </translations>
   </result>
</cities>
<!-- RUID: [UmFuZG9tSVYkc2RlIyh9YcxtmfhRwqry58sgWYNIgEV1AjdsVswrKUorBoUlR6ylFgiaj5XJ0w0DP0lL/htWqOKtE33w1EhBbLABKokIfEo=] -->

The file looks well formatted, in utf8, it contains Russian terms and symbols like "á" in Santa Bárbara.
I should read this file and create a record in a MySql DB (through C#), but I'm facing encoding problems.

PS: the DB table has a few columns (to store city id, country and city translations), all text fields, utf8_general_ci.

I'm trying the following code to read the files (just one in this case) in a folder

foreach (string file in Directory.EnumerateFiles("C:\xml_folder\"" + sub_folder, "*.xml")) {
    Console.WriteLine(file);

    string response = File.ReadAllText(file, Encoding.GetEncoding("Windows-1252"));

    Console.WriteLine(response);

    var document = XDocument.Parse(response);

    foreach (var child in document.Root.Elements("result")) {
         //... code here
 
        String name_it = "";
        String name_en = "";
        String name_es = "";
        String name_fr = "";
        String name_de = "";
        String name_ru = "";

        foreach (var translationsChild in child.Elements("translations"))
        {
            switch (translationsChild.Element("language").Value)
            {
                case "it":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_it = Encoding.UTF8.GetString(bytes);
                    break;
                case "en-gb":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_en = Encoding.UTF8.GetString(bytes);
                    break;
                case "es":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_es = Encoding.UTF8.GetString(bytes);
                    break;
                case "fr":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_fr = Encoding.UTF8.GetString(bytes);
                    break;
                case "de":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_de = Encoding.UTF8.GetString(bytes);
                    break;
                case "ru":
                    bytes = Encoding.Default.GetBytes(translationsChild.Element("name").Value);
                    name_ru = Encoding.UTF8.GetString(bytes);
                    Console.WriteLine(name_ru);
                    break;
            }
        }

In a few words, I get the file, than I convert it in XML to read all children and save it into the DB.

The problem seems related to the way (encoding) I'm getting the string from the file, I tried conversion in Windows-1252.

string response = File.ReadAllText(file, Encoding.GetEncoding("Windows-1252"));

I even tried conversion in utf8

string response = File.ReadAllText(file, System.Text.Encoding.UTF8);

but every time I get (in the debug console and in the DB), this:

Santa Bárbara -\> Santa B?rbara
Санта-Барбара -\> ?????-??????

It looks like a problem related to the way File.ReadAllText(...) works, encoding is not working at all.

PS: to store data into the DB I use a DML like this:

cmd.CommandText = "INSERT INTO cities (city_id,country,name,nr_hotels,name_it,name_en,name_es,name_fr,name_de,name_ru,last_modified_date) VALUES(@city_id,@country,@name,@nr_hotels,@name_it,@name_en,@name_es,@name_fr,@name_de,@name_ru,@last_modified_date) on duplicate key update city_id=@city_id,country=@country,name=@name,nr_hotels=@nr_hotels,name_it=@name_it,name_en=@name_en,name_es=@name_es,name_fr=@name_fr,name_de=@name_de,name_ru=@name_ru,last_modified_date=@last_modified_date";

Please, can you help me?
thanks in advance

Shadow
  • 33,525
  • 10
  • 51
  • 64
  • 2
    Take a moment to read through the [editing help](/editing-help) in the help center. Formatting on Stack Overflow is different than on other sites. The better your post looks, the easier it is for others to read and understand it. – gunr2171 Apr 10 '23 at 18:16
  • Also re-read the [mre] guidance on posting code - there is far too much code posted. Note that code reads utf-8 file as "Windows-1252" which makes no sense. While minimizing amount of code make sure encodings used make sense. – Alexei Levenkov Apr 10 '23 at 18:19
  • 1
    `File.ReadAllText(file, Encoding.GetEncoding("Windows-1252"))` does not "convert" the text to Windows-1252, but reads the text from the file assuming it is encoded in Windows-1252. – Klaus Gütter Apr 10 '23 at 18:19
  • Hi, I tried "File.ReadAllText" with Windows-1252 and utf8, but in both case I got: Санта-Барбара -> ?????-?????? (even though the initial file seems fine, in utf8, could it be a mysql problem? I used utf8_general_ci) – user21611183 Apr 10 '23 at 18:29
  • `I get (in the debug console` - have you [set the console to output Unicode](https://stackoverflow.com/q/5750203/11683)? `and in the DB)` - have you set up your charset in mysql? You should remove all your `GetBytes`/`GetString` (they do nothing other than destroy the string even further) and specify in `File.ReadAllText` the encoding in which your XML file is actually saved (which is hopefully, but not necessarily, UTF-8). – GSerg Apr 10 '23 at 18:29
  • the DB charset I'm using is **general_utf8_ci**, hoping this was right. I removed all getBytes/GetString, leaving the initial: string response = File.ReadAllText(file, System.Text.Encoding.UTF8); Nothing changed unfortunately – user21611183 Apr 10 '23 at 18:43
  • yes!! now the problem is fixed, thank to you all!! I'm super happy :) – user21611183 Apr 10 '23 at 19:50

1 Answers1

0

I don't see any sense in converting to a byte array and back. This works properly for me

    string response = File.ReadAllText(file, Encoding.UTF8);
    var document = XDocument.Parse(response);

    foreach (var child in document.Root.Elements("result"))
    {
        //... code here

        String name_en = "";
        String name_ru = "";


        foreach (var translationsChild in child.Elements("translations"))
        {
            var name = translationsChild.Element("name").Value;
            Console.WriteLine(name);
            switch (translationsChild.Element("language").Value)
            {
                case "en-gb":
                    name_en = name;
                    break;

                case "ru":
                    name_ru = name;
                    break;
            }
        }
    }

output

Santa Bárbara
Санта-Барбара
Serge
  • 40,935
  • 4
  • 18
  • 45
  • I can see "Santa Bárbara" well written now, but "Санта-Барбара" is still "?????-???????" in debug console and in DB as well – user21611183 Apr 10 '23 at 19:22
  • @user21611183 It doesn't make any sense for me at all. If you can see them normaly in XML you WILL certainly see them normaly after parsing. Your visualizer has a wrong font that doesnt support russian letters, try to change the font. – Serge Apr 10 '23 at 19:35
  • Now it works! I added the charset in the DB connection string -->"Server=myServerAddress;Database=myDataBase;Uid=myUsername;Pwd=myPassword; CharSet=utf8;" thanks for your help :) – user21611183 Apr 10 '23 at 19:49