2

I've got a Scala Akka App where I execute python scripts inside Futures with ProcessBuilder.

Unfortunately are special character not displayed correct, so do I get instead of mädchen-> m�dchen (äöü -> �)

If I execute the python script via command line do I get the right output of "mädchen", so I assume it has nothing to do with the python script instead somehow related to my Scala input read.

Python Spider:

print("mädchen")

Scala:

val proc = Process("scrapy runspider spider.py")

var output : String = ""
val exitValue = proc ! ProcessLogger (
   (out) => if( out.trim.length > 0 )
     output += out.trim,
   (err) =>
     System.err.printf("e:%s\n",err)
)

println(exitValue) // 0 -> succ.
println(output)    // m�dchen -> should be mädchen

I already tried many thinks and also read that Strings are by default UTF-8 so I am not sure why I get those question marks.

Also did I tried with no success:
var byteBuffer : ByteBuffer = StandardCharsets.UTF_8.encode(output.toString())
val str = new String(output.toString().getBytes(), "UTF-8")


Update:

It seems to be a windows related issue, following instruction will solve this problem: Using UTF-8 Encoding (CHCP 65001) in Command Prompt / Windows Powershell (Windows 10)

MJey
  • 345
  • 3
  • 16
  • What is the encoding of the Python file? It's clearly not UTF-8, since `ä` is encoded as two bytes in UTF-8, but you only get one question mark. So, you can force Scala to read UTF-8 as much as you want, but you also need to actually feed it UTF-8 instead of something else. – Jörg W Mittag May 04 '20 at 16:45
  • Can you try casting the individual chars in output to `Int` and seeing their values? Also, I'm fairly sure JVM chars are UTF-16 codepoints – user May 04 '20 at 17:32
  • 3
    `new String(output.toString().getBytes(), "UTF-8")` works for me as I just tested. You might wanna print out the byte array and make sure it is `byte[8] { 109, -61, -92, 100, 99, 104, 101, 110}` – SwiftMango May 04 '20 at 17:42
  • @texasbruce &@user thanks for the reply, I got slightly different numbers: byte[9] { 109, -17, -65, -67, 100, 99, 104, 101, 110} / I am using Windows 10 maybe it's an OS issue? – MJey May 04 '20 at 19:20
  • @JörgWMittag thank you for the reply, I have #!/usr/bin/env python # -*- coding: utf-8 -*- in the beginning of the python spider file also tried print("mädchen".encode(encoding='UTF-8')) then I got an byte[17] 98|39|109|92|120|99|51|92|120|97|52|100|99|104|101|110|39 -> b'm\xc3\xa4dchen' – MJey May 04 '20 at 19:31
  • @texasbruce it works now, finally :) Thanks for all the help! It seems to be an windows os problem, this instruction fixed it for me: https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window – MJey May 04 '20 at 20:02
  • What was the encoding that was set initially in windows? – SwiftMango May 04 '20 at 20:04
  • @texasbruce I am not to sure tbh, I think it is https://en.wikipedia.org/wiki/Windows-1252 in my case (Europe). In the settings there was just a checkbox to click with "Use UTF-8" – MJey May 04 '20 at 20:18

0 Answers0