1

In my application, I am reading the data (Japanese text) from DB (UTF-8) and trying to write the output in SHIFT_JIS file format. However, full width ー (817C hex code in shift JIS) is getting converted as ? in the output file.

Here is the sample program to test it

public class ShiftJisTest {

    public static void main(String[] args) {
        String text = "東1-1";
        try (BufferedWriter writer = new BufferedWriter(
                new OutputStreamWriter(new FileOutputStream("output_data"), "SHIFT_JIS"))) {
            writer.write(text);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Output:

東1?1

Hex Code value of output:

93 8C 82 50 3F 82 50

Garbled character in HEX: 3F, expected was 81 7C

Prince
  • 291
  • 1
  • 2
  • 9
  • 1
    I suggest reducing that example string to fewest possible characters needed to show your issue. – Basil Bourque Aug 23 '23 at 06:26
  • Here is the input & output for character having issue. Input: String text = "1-"; Output: 1? – Prince Aug 23 '23 at 06:34
  • [Unicode replacement character](https://en.wikipedia.org/wiki/Specials_(Unicode_block)) is used when the encoding doesn't support the character. I'd say you either have the wrong kind of `-`, or this is a [bug](https://bugs.openjdk.org/browse/JDK-4852917) in SHIFT_JIS support. – Kayaman Aug 23 '23 at 06:39
  • 1
    According to [this](https://www.fileformat.info/info/charset/Shift_JIS/list.htm) `0x817C` is [U+2212 minus sign](https://www.fileformat.info/info/unicode/char/2212/index.htm). This reproduces on my machine, as in the hyphen is translated to `3F/?`. – Kayaman Aug 23 '23 at 06:49
  • 1
    Those two characters in the second Comment are 65297 & 65293 decimal. See [code run at Ideone.com](https://ideone.com/xy1TOy). Character names are Fullwidth Digit One and Fullwidth Hyphen-Minus. – Basil Bourque Aug 23 '23 at 07:19
  • If you are using Windows then in your `FileOutputStream()` constructor replace "SHIFT_JIS" with "windows-932". I just tested that change, and it fixes your problem. See the [accepted answer](https://stackoverflow.com/a/29784905/2985643) for [How to read this text ①② in a SJIS file?](https://stackoverflow.com/q/29778255/2985643) Your question looks like a duplicate of that one. (If you are not using Windows then specify your O/S.) – skomisa Aug 23 '23 at 07:33
  • 1
    From testing it looks like you can also specify "windows-31j" or "MS932" or "cp932" instead of "windows-932". I don't know which of those names (if any) are supported in non-Windows environments. – skomisa Aug 23 '23 at 08:13
  • Looks like CP932/MS932 gives the proper output. – Prince Aug 23 '23 at 09:09

1 Answers1

0

Looks like that character is not in Shift_Jis:

goose@t410:/tmp$ uniname '\uFF0D'
The name for codepoint \uFF0D is FULLWIDTH HYPHEN-MINUS
The char is -
goose@t410:/tmp$ echo -en '\uFF0D' | iconv -t SHIFT-JIS
iconv: illegal input sequence at position 0
g00se
  • 3,207
  • 2
  • 5
  • 9
  • Although depending on the [converter](https://youtrack.jetbrains.com/issue/IDEA-122942/incorrect-encoding-Shiftjis-for-some-special-characters) it could/should be converted to the minus sign. – Kayaman Aug 23 '23 at 07:54
  • This may explain the OP's problem, but it doesn't solve it. – skomisa Aug 23 '23 at 08:05
  • The solution is to not use non-existent characters. Substitution can be used but that's arbitrary and subjective – g00se Aug 23 '23 at 08:52
  • No, that is not the solution at all. The solution is to select an appropriate encoding for the text being processed. The OP's problem arose only because an inappropriate encoding was used. – skomisa Aug 23 '23 at 17:06
  • Presumably the @Prince has a reason for using Shift_Jis. If he hasn't then the solution is simple - use UTF-8. – g00se Aug 23 '23 at 17:30