0

If a string of data contains characters with different encodings, is there a way to change charset encoding after an input stream is created or suggestions on how it could be achieved?

Example to help explain:

// data need to read first 4 characters using UTF-8 and next 4 characters using ISO-8859-2?
String data = "testўёѧẅ"
// use default charset of platform, could pass in a charset 
try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
    // probably an input stream reader to use char instead of byte would be clearer but hopefully the idea comes across
    byte[] bytes = new byte[4]; 
    while (in.read(bytes) != -1) {
        // TODO: change the charset here to UTF-8 then read values

        // TODO: change the charset here to ISO-8859-2 then read values
    }
}

Been looking at decoders, might be the way to go:

Attempt using same input stream:

String data = "testўёѧẅ";
    InputStream inputStream = new ByteArrayInputStream(data.getBytes());
    Reader r = new InputStreamReader(inputStream, "UTF-8");
    int intch;
    int count = 0;
    while ((intch = r.read()) != -1) {
        System.out.println((char)ch);
        if ((++count) == 4) {
            r = new InputStreamReader(inputStream, Charset.forName("ISO-8859-2"));
        }
    }

//outputs test and not the 2nd part

Mercury
  • 711
  • 1
  • 11
  • 21

2 Answers2

0

Assuming that you know there will be n UTF-8 characters and m ISO 8859-2 characters in your stream (n=4, m=4 in your example), you can do by using two different InputStreamReaders working on the same InputStream:

try (InputStream in = new ByteArrayInputStream(data.getBytes())) {
    InputStreamReader inUtf8 = new InputStreamReader(in, StandardCharsets.UTF_8);
    InputStreamReader inIso88592 = new InputStreamReader(in, Charset.forName("ISO-8859-2"));


    // read `n` characters using inUtf8, then read `m` characters using inIso88592
}

Note that you need to read characters not bytes (i.e. check how many characters how been read so far, as in UTF-8 a single character may be encoded on 1-4 bytes).

syntagma
  • 23,346
  • 16
  • 78
  • 134
  • For some odd reason when 2 readers are created with the same InputStream then the first one will read fine, but the second one does nothing. Updated the post with an example. – Mercury Feb 10 '20 at 19:59
0

String contains Unicode so it can combine all language scripts.

String data = "testўёѧẅ";

For that String uses a char array, where char is UTF-16. Sometimes a Unicode symbol, a code point, needs to be encoded as two chars. So: char only for a part of the Unicode maps Unicode code points exactly. Here it might do:

String d1 = data.substring(0, 4);
byte[] b1 = data.getBytes(StandardCharsets.UTF_8); // Binary data, UTF-8 text

String d2 = data.substring(4);
Charset charset = Charset.from("ISO-8859-2");
byte[] b2 = data.getBytes(charset); // Binary data, Latin-2 text

The number of bytes do not need to correspond to the number of code points. Also é might be 1 code point é, or two code points: e and a zero width ´.

To split text by script or Unicode block:

data.codePoints().forEach(cp -> System.out.printf("%-35s - %-25s - %s%n",
            Character.getName(cp),
            Character.UnicodeBlock.of(cp),
            Character.UnicodeScript.of(cp)));

Name:                                 Unicode block:              Script:
LATIN SMALL LETTER T                - BASIC_LATIN               - LATIN
LATIN SMALL LETTER E                - BASIC_LATIN               - LATIN
LATIN SMALL LETTER S                - BASIC_LATIN               - LATIN
LATIN SMALL LETTER T                - BASIC_LATIN               - LATIN
CYRILLIC SMALL LETTER SHORT U       - CYRILLIC                  - CYRILLIC
CYRILLIC SMALL LETTER IO            - CYRILLIC                  - CYRILLIC
CYRILLIC SMALL LETTER LITTLE YUS    - CYRILLIC                  - CYRILLIC
LATIN SMALL LETTER W WITH DIAERESIS - LATIN_EXTENDED_ADDITIONAL - LATIN
Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
  • The 1st conversion part is spot on, but if the data was not just a simple string so an inputstream then each read function will use a certain character encoding which cannot change once set unless via some sort of reflection. – Mercury Feb 10 '20 at 20:37
  • With an InputStream one is reading bytes. Byte arrays can be converted to String if one knows the encoding of the bytes. And somehow a length of the bytes can be deduced. `String read(InputStream in, int byteCount, Charset charset)` is easy to implement. – Joop Eggen Feb 11 '20 at 09:29