1

I've tried many methods to extract some strings out of a JSON file using LitJson in Unity.

I've encoding converts all over, tried getting byte arrays and sending them around and nothing seems to work.

I went to the very start of where I create the JsonData object and tried to run the following test:

public JsonData CreateJSONDataObject()
{
    Debug.Assert(pathName != null, "No JSON Data path name set. Please set before commencing read.");
    string jsonString = File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.UTF8);        
    JsonData jsonDataObject = JsonMapper.ToObject(jsonString);
    Debug.Log("Test compatibility: ë | " + jsonDataObject["Roots"][2]["name"]);
    return jsonDataObject;
}

I made sure my jsonString is using UTF-8, however the output shows this:

Test compatibility: ë | W�den

I've tried many other methods, but as this is making sure to encode right when creating a JsonData object I can't think of what I am doing wrong as I just don't know enough about JSON.

Thank you in advance.

Voidjumper
  • 13
  • 1
  • 5
  • 1
    Did you verify that `jsonString` contains the `ë` character as expected? – dbc Jul 28 '17 at 23:08
  • What happens when you use Unity's built-in [JsonUtility](https://stackoverflow.com/questions/36239705/serialize-and-deserialize-json-and-json-array-in-unity/36244111#36244111) to serialize and deserialize the data? Is this problem still there? – Programmer Jul 28 '17 at 23:42
  • @dbc In this case I used index 2 which I know is the string "Wöden." However it outputted as W�den. Happens when I choose a string containing "ë" as well. – Voidjumper Jul 29 '17 at 02:25
  • @Programmer I have not used it yet. I will have to set up a test case tomorrow morning and see how it works and see if I can migrate across to that. Thanks for the suggestion. – Voidjumper Jul 29 '17 at 02:26
  • Np. You let us know how it goes. – Programmer Jul 29 '17 at 02:28

1 Answers1

2

This type of problem occurs when a text file is written with one encoding and read using a different one. I was able to reproduce your problem with the following program, which removes the JSON serialization from the equation entirely:

string file = @"c:\temp\test.txt";
string text = "Wöden";
File.WriteAllText(file, text, Encoding.Default));
string text2 = File.ReadAllText(file, Encoding.UTF8);
Debug.WriteLine(text2);

Since you are reading with UTF-8 and it is not working, the real question is, what encoding was used to write the file originally? You should be using the same encoding to read it back. I suspect that the file was originally created using either Windows-1252 or iso-8859-1 instead of UTF-8. Try using one of those when you read the file, e.g.:

string jsonString = File.ReadAllText(Application.dataPath + pathName,
                                     Encoding.GetEncoding("Windows-1252"));

You said in the comments that your JSON file was not created programmatically, but was "written by hand", meaning you used Notepad or some other text editor to make the file. If that is so, then that explains how you got into this situation. When you save the file, you should have the option to choose an encoding. For Notepad at least, the default encoding is "ANSI", which most likely maps to Windows-1252 (Western European), but depends on your locale. If you are in the Baltic region, for example, it would be Windows-1257 (Baltic). In any case, "ANSI" is not UTF-8. If you want to save the file in UTF-8 encoding, you have to specifically choose that option. Whatever option you use to save the file, that is the encoding you need to use to read it the next time, whether it is with a text editor or with code. Using the wrong encoding to read the file is what causes the corruption.

To change the encoding of a file, you first have to read it in using the same encoding that it was saved in originally, and then you can write it back out using a different encoding. You can do that with your text editor, simply by re-saving the file with a different encoding, or you can do that programmatically:

string text = File.ReadAllText(file, originalEncoding);
File.WriteAllText(file, text, newEncoding); 

The key is knowing which encoding was used originally, and therein lies the rub. For legacy encodings (such as Windows-12xx) there is no way to tell because there is no marker in the file which identifies it. Unicode encodings (e.g. UTF-8, UTF-16), on the other hand, do write out a marker at the beginning of the file, called a BOM, or byte-order mark, which can be detected programmatically. That, coupled with the fact that Unicode encodings can represent all characters, is why they are much preferred over legacy encodings.

For more information, I highly recommend reading What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

Brian Rogers
  • 125,747
  • 31
  • 299
  • 300
  • I tried using `string jsonString = File.ReadAllText(Application.dataPath + pathName, Encoding.GetEncoding("Windows-1252"));` and then `Debug.Log()`'ed it. It seems to show umlaut characters fine in the console. I then tried it with `iso-8859-1` and it also showed fine in the console. How do I go about changing the encoding? The file is a massive list of names and properties which I am reading in to set to different objects in game. It was written by hand, not give as a string and written through code (for example, using `WriteAllText()`) Thank you for your help so far though. – Voidjumper Jul 29 '17 at 09:15
  • However, even reading that string with "Windows-1259" and then continuing with with all the same JSON serialisation works perfectly. While it is good to know more, I think this fulfills my needs for now. Much appreciated. – Voidjumper Jul 29 '17 at 09:19
  • Thank you. I very much appreciate it. I thought it was working fine, using either Windows or iso encoding, I could see all sorts of characters in my textboxes, when viewed in the Editor. I could see them as well in the game view. I even built the project and they're still visible. However, when I give input to pull the next name from the string[] to fill the text box, absolutely nothing is displayed. I have the font imported with Unicode encoding, the file saved as Unicode from Notepad, and am using `File.ReadAllText(Application.dataPath + pathName, System.Text.Encoding.Unicode);` – Voidjumper Jul 30 '17 at 18:33
  • If it was a bit ambigous, the text boxes are only blank after a button press pulls the next string in the Build of the game. All names are visible in the Editor and the Game view, with the same button presses cycling through the string[]. – Voidjumper Jul 30 '17 at 18:35