0

I'm trying to develop an application within Android Studio on Windows 10.

PROBLEM: The following string array of Thai words:

String[] myTHarr = {"มาก","เชี่ยว","แน่","ม่อน","บ้าน","พูด","เลื่อย","เมื่อ","ช่ำ","แร่"};

...when processed by the following for-each loop:

for (String s:myTHarr){
  //s = มา� before executing any of the below code:
  byte[] utf8EncodedThaiArr = s.getBytes("UTF-8"); 
  String utf8EncodedThai = new String(utf8EncodedThaiArr); //setting breakpoint here
  // s is still มาà¸�     (I want it to be มาก)
  //do stuff
}

results in s = มา� when attempting to process the first word (none of the other words work either, but that's expected given the first fails).

The Thai script appears in the string array correctly (the declaration was copied straight from Android Studio), the file encoding is set to UTF-8 for the java file (per here), and the File Encoding Settings look like this (per here):

enter image description here

sacredfaith
  • 850
  • 1
  • 8
  • 22
  • 1
    You might have a misconception here. In Java, Strings are not encoded in any way (nitpicking: okay, you might call the internal representation UTF-16 or similar), they are just sequences of characters. Encoding a String as a UTF-8 byte[] array and decoding that (using UTF-8) gives exactly the original String, so it's useless. Only byte[] arrays or external files are encoded representations of Strings, in e.g. UTF-8 or ISO-8859-1. If you don't see the Strings from `myTHarr` the way you want, there must be a reason outside this code snippet. – Ralf Kleberhoff Aug 25 '20 at 14:26
  • I took out the portion of the loop where I actually do something with the text since it doesn't matter for the question. It's summarized as '//do stuff'. Fact is, it's broken before I even have a chance to. – sacredfaith Aug 25 '20 at 15:02
  • 1
    `//s = มาà¸� before executing any of the below code` That suggests your compiler’s file encoding is not, in fact, UTF-8. Those characters indicate that the compiler treated the UTF-8 bytes of your source file as if they were windows-125x or ISO-8859-x bytes. – VGR Aug 25 '20 at 15:03
  • VGR, you and Ralf seem to be alluding to the same idea. I buy that, I'm just not sure what/where else I need to change things to UTF-8. In the bottom right of the window, I see 'UTF-8' and when I go to Settings > File Encodings both the global and project encodings are set to UTF-8. Any other ideas? – sacredfaith Aug 25 '20 at 15:05

3 Answers3

2

According to the documentation, String(byte[]) constructor "Constructs a new String by decoding the specified array of bytes using the platform's default charset."

I'm guessing that the default character set is not UTF-8. So the solution is to specify the encoding for the array of bytes.

String utf8EncodedThai = new String(utf8EncodedThaiArr, "UTF-8"); //setting breakpoint here
markspace
  • 10,621
  • 3
  • 25
  • 39
  • The OP already has the original string in `s`. Why not go one step further and simplify the code to `String utf8EncodedThai = s;` ? – Joni Aug 25 '20 at 14:17
  • 4
    I assume that it's just an example and the actual problem is a byte array that the OP got off the wire or from a database. I.e., it's not actually a string literal. – markspace Aug 25 '20 at 14:19
  • But that would still not make sense: then the OP would just be converting a string to UTF-8 and immediately back to a string again, which effectively does nothing. – Jesper Aug 25 '20 at 14:23
  • 3
    Actually reading a bit more the OP mentions "file encoding" so Ii assume that they're having trouble reading the bytes from a data file. The OP's code is just a [mcve]; we don't have the file so they're just using strings in place of reading the file. I don't see why that's difficult to understand. – markspace Aug 25 '20 at 14:25
  • Perhaps I can elucidate what's going on a bit more. Yes, I am striving for a minimal reproducible example, and I'm also converting a string to UTF-8, and back again (sort of). The part that doesn't make sense, and the whole reason for the question, is why on earth it 'loses' that UTF-8 encoding. How on earth can it start as มาก (Thai for 'much' or 'many'), and then get turned to mush the moment I start trying to actually do something with it in the array. – sacredfaith Aug 25 '20 at 15:00
  • Use the overload that accepts a `Charset`: `new String(arr, StandardCharsets.UTF_8)` – erickson Aug 25 '20 at 15:53
  • @sacredfaith I expected this answer to solve your problem. Does it? – erickson Aug 25 '20 at 15:59
  • @sacredfaith "*why on earth it 'loses' that UTF-8 encoding. How on earth can it start as มาก (Thai for 'much' or 'many'), and then get turned to mush the moment I start trying to actually do something with it in the array.*" - because you are not specifying the UTF-8 charset when decoding the UTF-8 byte array back to a string, and clearly your environment's default charset is not set to UTF-8. – Remy Lebeau Aug 25 '20 at 18:43
  • Thank you all so much for pointing me in the right direction. Your explanations genuinely helped me understand *why* my code was breaking. – sacredfaith Aug 26 '20 at 09:47
0

As several in the comments pointed out the problem had to be within my environment. After a bit more searching I found I should have rebuilt the project after changing the encodings (so merely switching to UTF8 and clicking 'Apply'/'OK' wasn't enough). I should note here that my File Encoding settings look like this, for reference: enter image description here

Once I rebuilt, I started getting the compiler error "unmappable character for encoding cp1252" on the String array containing the Thai (side note: Some of the Thai characters were fine, others rendered as � and friends. I would have thought either all of the Thai would work or none of it, but was surprised to see even common Thai letters such as ก cause the compiler to choke).

That error led to this post in which I tried a few things to set the compiler options to UTF8. Since my application happens to be a sort of 'pre-process' for an android app, and is therefore separate from the app itself (if that makes any sense), I didn't have the luxury of using the compilerOptions attribute as the answers in the aforementioned SO post recommended (though I have since added it to the gradle on the android app side). This led me to setting the environment variable JAVA_TOOLS_OPTIONS via powershell:

setx JAVA_TOOLS_OPTIONS "-Dfile.encoding=UTF8"

Which fixed the issue!

sacredfaith
  • 850
  • 1
  • 8
  • 22
-1

I tried your code with the attached settings, and the code worked fine.enter image description here

O_K
  • 922
  • 9
  • 14
  • This problem is due to differences in the environment where the code runs, so unless you point out the difference that matters, this answer is not helpful. – erickson Aug 25 '20 at 15:55