1

I want to read some Unicode characters from console (Farsi Characters). I have used System.in but it didn't work. Looks like that Standard Input does not understand the characters I'm writing in the input so its just returns some mumbo jumbo to my String variable. I am absolutely sure that String variable's standard is set to "UTF-8". Believe me i doubled check.

Some pieces of code that I tried.

String t = new String (new Scanner(System.in).nextLine().getBytes() , "UTF-8");

didn't work.

byte b[] = new byte[4];
System.in.read(b);
String st = new String (b , "UTF-16");
System.out.println(st);

I wrote the above code for reading just one Farsi character. didn't work either.

Ark1375
  • 13
  • 5
  • 1
    This depends on the console. What OS, what console (native, IDE, etc)? – rustyx Jun 14 '18 at 10:11
  • 1
    As for the 1st example it seems your default encoding (that `getBytes()` would use is not UTF-8). What would `System.out.println(new Scanner(System.in).nextLine())` do? – david a. Jun 14 '18 at 10:24
  • @rustyx Os is Windows 10 and it's the IDE's console I'm using (IDE is NetBeans). – Ark1375 Jun 14 '18 at 10:38
  • @davida. Yes, exactly. The problem is reading from the console. I can even read text files with "UTF-8" standards and its fine. But when it gets to reading from the console everything gets messed up. The second piece of code that you wrote is no different. Same exact problem. I know the problem is "Reading" for sure. Because for example this line executes just fine `System.out.println("فارسیFarsi")`; – Ark1375 Jun 14 '18 at 10:47
  • For all clarity, **String/char/Reader/Writer** is for text, and internally keeps text in Unicode, so all scripts can be combined. **byte[]/InputStream/OutputStream** is for binary data. Specifying an encoding is for that binary data, to indicate that the _data_ is text in some encoding. So better always use `s.getBytes(charset)` and `new String(bytes, charset)` otherwise the bytes default to the operating system encoding. – Joop Eggen Jun 14 '18 at 12:02
  • @JoopEggen I used the suggested option in the first code I wrote. Same problem came up. – Ark1375 Jun 14 '18 at 18:09

1 Answers1

0

First of all, the console must be in UTF-8 mode.

If using NetBeans, edit the file <NetBeansRoot>/etc/netbeans.conf. Under netbeans_default_options, add -J-Dfile.encoding=UTF-8.

Once you're sure the console and your project encoding are set to UTF-8, try this:

Scanner console = new Scanner(new InputStreamReader(System.in, "UTF-8"));
while (console.hasNextLine())
    System.out.println(console.nextLine());

Note: System.in is an InputStream, i.e. a stream of bytes, it produces the bytes from the console 1-to-1.
To read characters you need a Reader. A Reader takes an InputStream and an encoding, and produces characters.

If it doesn't help, try another console (e.g. Windows cmd, but first run chcp 65001).

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • Didn't work. I tried everything you said. I even declared a String variable with specific "UTF-8" charset and wrote the byte array directly in variable. Same problem. I did something else too. I changed my keyboard settings to "Arabic" since I couldn't find any "Persian" or "Farsi" character table in the UNICODE documentation. No difference either. I tried to put the charset settings in "UTF-16" mode but another problem came up. The program kept asking for more inputs. I typed a dozen different input in English and Farsi but it kept asking for more. Why is that? – Ark1375 Jun 14 '18 at 18:04
  • Try `while (true) System.out.print(String.format("%02x ", System.in.read()));` and post the results – rustyx Jun 14 '18 at 18:17
  • Ok, so I ran the code. Examples: inp: `گ` out: `90 0a` inp: `پ` out: `81 0a` inp: `A` out: `41 0a` A question. What are these exactly? – Ark1375 Jun 14 '18 at 18:42
  • That looks like [Windows-1256](https://msdn.microsoft.com/en-us/library/cc195058.aspx). In UTF-8,`گ` would be `da af`. Your console isn't set to UTF-8 mode. – rustyx Jun 14 '18 at 18:48
  • I changed NetBeans default encryption to UTF-8 using [this](https://stackoverflow.com/questions/24778725/how-to-change-default-encoding-in-netbeans-8-0) and It worked. Thanks a lot. – Ark1375 Jun 14 '18 at 19:09