3

I have a problem when I create a file using the Shift-JIS charset.

This is an example of text that I want write into a txt file:

繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

Using Shift-JIS charset, into the file I find two '?' instead of ~ and ―:

繰戻_日経選挙システム保守2019年1月10日?;[2019年度更新]横浜第1DCコロケ?ション(2ラック)

Using UTF-8 charset, into the file I find (all correct):

繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)

This is my code:

package it.grupposervizi.easy.ef.etl.elaboration;

import com.nimbusds.jose.util.StandardCharset;
import java.io.File;
import java.io.IOException;
import java.nio.charset.Charset;
import java.util.Arrays;
import java.util.List;
import org.apache.commons.io.FileUtils;

public class TestShiftJIS {

  private static final String TEXT = "繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)";
  private static final String DIRECTORY = "C:\\temp\\japan\\";
  private static final String SHIFT_JIS = "Shift-JIS";
  private static final String UTF_8 = StandardCharset.UTF_8.name();
  private static final String EXTENSION = ".txt";

  public static void main(String[] args) {

    final List<String> charsets = Arrays.asList(SHIFT_JIS, UTF_8);
    charsets.forEach(c -> {
      final String fName = DIRECTORY + c + EXTENSION;
      File file = new File(fName);
      try {
        FileUtils.writeStringToFile(file, TEXT, Charset.forName(c));
      } catch (IOException e) {
        throw new RuntimeException(e);
      }
    });

    System.out.println("End Test");
  }
}

Do you have any idea why these two chars are not included into the Shift-JIS charset?

Dharman
  • 30,962
  • 25
  • 85
  • 135
  • Could it be that the file editor you use to look at the file can't display those characters? – assylias Sep 02 '20 at 13:21
  • 1
    Questionable characters are (_DashPunctuation_) `―` U+2015 *Horizontal Bar* and (_MathSymbol_) `~` U+FF5E *Fullwidth Tilde*. I doubt that those characters are in `Shift-JIS`… – JosefZ Sep 02 '20 at 13:45

3 Answers3

1

///EDIT:

You try to save file that has uncommon (different from default) encoding. Try to change encoding of chars. more about encoding » https://en.wikipedia.org/wiki/Character_encoding

///

Try using: Charset.forName("CP943C")

Koziołek
  • 2,791
  • 1
  • 28
  • 48
  • 1
    Can you please edit your answer and explain why this line of code can solve this issue? Explaining how and why this solves the problem would really help to improve the quality of your post, and probably result in more up-votes. – HardcoreGamer Aug 12 '21 at 04:19
  • Please explain your answer. OP should not only get the solution, but reason why it is ok too. – Koziołek Aug 12 '21 at 12:28
  • I tried to use Charset.forName("CP943C") but without success. At the moment I solve using always UTF8. – Giulio Andolfi Aug 13 '21 at 12:28
0

@JosefZ has basically already given the answer: Shift-JIS does not support (U+FF5E) and (U+FF5E).

This can be verified using Charset.newEncoder().canEncode(char):

public class ShiftJisTest {
    public static void main(String[] args) {
        // 繰戻_日経選挙システム保守2019年1月10日~;[2019年度更新]横浜第1DCコロケ―ション(2ラック)
        String s = "\u7e70\u623b\u005f\u65e5\u7d4c\u9078\u6319\u30b7\u30b9\u30c6\u30e0\u4fdd\u5b88\u0032\u0030\u0031\u0039\u5e74\u0031\u6708\u0031\u0030\u65e5\uff5e\u003b\u005b\u0032\u0030\u0031\u0039\u5e74\u5ea6\u66f4\u65b0\u005d\u6a2a\u6d5c\u7b2c\uff11\u0044\u0043\u30b3\u30ed\u30b1\u2015\u30b7\u30e7\u30f3\uff08\uff12\u30e9\u30c3\u30af\uff09";
        Charset charset = Charset.forName("Shift-JIS");
        for (char c : s.toCharArray()) {
            CharsetEncoder encoder = charset.newEncoder();
            if (!encoder.canEncode(c)) {
                System.out.printf("%s (U+%04X)%n", c, (int) c);
            }
        }
        
        try {
            charset.newEncoder().encode(CharBuffer.wrap(s));
        } catch (CharacterCodingException e) {
            // java.nio.charset.UnmappableCharacterException: Input length = 1
            e.printStackTrace();
        }
    }
}

The reason why you are seeing ? is because Apache Commons IO's FileUtils.writeStringToFile(File, String, Charset) internally (1, 2) uses String.getBytes(Charset) whose documentation says:

[...] This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.

And the CharsetEncoder documentation says:

[...] The replacement is initially set to the encoder's default replacement, which often (but not always) has the initial value { (byte)'?' }

Marcono1234
  • 5,856
  • 1
  • 25
  • 43
  • Thanks for your answer. I tried your code and the two chars are encoded with Shift-JIS. I tried to use another method to write a file but without success. I will try again. – Giulio Andolfi Sep 04 '20 at 16:01
  • @GiulioAndolfi what do you mean by "two chars are encoded with Shift-JIS"? My code was supported to demonstrate that they cannot be encoded. – Marcono1234 Sep 04 '20 at 16:22
0

As @Marcono1234 answered, the Shift-JIS mapping in Java does not support (U+FF5E) and (U+FF5E). To map these codepoints from UTF-8 into Shift-JIS encoding, you have to use Charset.forName("windows-31j"); or Charset.forName("MS932"); rather than Charset.forName("Shift-JIS");.

SATO Yusuke
  • 1,600
  • 15
  • 39