UTF-8 letters not displayed correctly

Question

I am querying json-formatted data using apache drill on windows 10 from a dos-prompt. I am following their guide.

I have the very basic json-object {"år":"2018", "æøå":"ÆØÅ"} and when I query it from apache drill the output is not displayed correctly.

select * from dfs.`C:\Users\foo\Downloads\utf8.json`;
+-------+------+
|  Õr   | µ°Õ  |
+-------+------+
| 2018  | ãÏ┼  |
+-------+------+
1 row selected (0,114 seconds)

The file is saved in UTF-8 format (using sublime text). I have also tried to save it in UTF-8 with BOM but it did not make a difference.

Setting the environment variable as mentioned in this SO-thread using

set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF8

does not help.

EDIT:

Slightly after posting I found a SO-thread that suggested to change the windows codepage to 65001 (utf-8). This shows the correct letters but also prevents the command-history (arrow-up) from working properly.

chcp 65001
sqlline.bat -u "jdbc:drill:zk=local"
select * from dfs.`C:\Users\cgu\Downloads\utf8.json`;
+-------+------+
|  år   | æøå  |
+-------+------+
| 2018  | ÆØÅ  |
+-------+------+

You're using a Windows console. Apache Drill apparently has a console-based executable, which has nothing to do with the CMD shell other than it's using some convenience batch script (sqlline.bat). The console is natively UTF-16 (i.e. `wchar_t` based), but cross-platform projects typically use its legacy `char` based API that uses code pages. You already discovered codepage 65001 for UTF-8. If you're using Windows 8+ this works well for UTF-8 output. It's buggy in older versions, and even in Windows 10 it's broken for non-ASCII input (i.e. limited to the first 128 Unicode characters). — Eryk Sun, Jul 24 '18 at 12:21
If you're losing access to command history, you may have more processes attached to the console than it has history buffers. Each gets its own history. Open the console properties dialog, and increase the number of history buffers to 32. — Eryk Sun, Jul 24 '18 at 12:23
@eryksun Thank you for your suggestions. Increasing hist.buffers helped a bit. I ended up installing linux subsystem in win10, installed ubuntu using https://superuser.com/questions/1271682/is-there-a-way-of-installing-windows-subsystem-for-linux-on-win10-v1709-withou (can't use MS Store at work) which gives me a more familiar working environment. — kometen, Jul 25 '18 at 10:29
I find it interesting that the Linux subsystem (WSL) uses the same console backend (conhost.exe via the condrv.sys driver), but apparently via different internal APIs that have no problem reading non-ASCII input as UTF-8. — Eryk Sun, Jul 25 '18 at 10:43
In contrast, when talking to a Windows application (e.g. via `ReadFile` or `ReadConsoleA`), the same console backend substitutes NUL ("\x00") for all non-ASCII characters when the console input codepage is 65001. By attaching a debugger to conhost.exe, I know this happens because of a bad assumption that the codepage is a fixed-size encoding (typically single-byte), whereas UTF-8 uses 1-4 bytes. But I don't understand why they only fixed it for WSL and not the Windows API. I guess they have their priorities. — Eryk Sun, Jul 25 '18 at 10:45
@ery: WSL has next to no legacy. Fixing WSL is easy. Fixing the Windows API is near impossible, given the amount of compatibility issues that come with changing even the most mundane of aspects. The priorities are clear: Don't break stuff, even if it is broken. — IInspectable, Nov 20 '18 at 21:06

UTF-8 letters not displayed correctly

0 Answers0