How can I generate non-UTF-8 string / char in Java for testing purposes?

Question

I think I have looked everywhere. Have found some examples in Ruby but nothing coherent in Java.

Specifically, I have a json file that holds different key-values related to some translations mechanism we use (so, a lot of languages involved), this json parsed with some mapper that we have.

I want to check if the mapper returns the proper values in case a non-UTF data encountered in the json.

So, I would like to use it in approach like:

String expectedValue = "FooBarNonUtf8";
String actualValue = jsonReader.readFrom("file", "key"); //should parse non-UTF correctly
assertEquals(expectedValue, actualValue);

Since UTF-8 is meant to comprise all characters (even emojis) it's a little hard to generate one that isn't in UTF-8. Or do you mean another encoding, like Latin-1, i.e. any encoding except UTF-8? — Thomas, Aug 15 '18 at 09:29
UTF-8 is a character encoding. It transforms characters (which don't have any encoding), into bytes. Not sure what you mean by "non-UTF-8 string", since a String contains characters, not bytes. — JB Nizet, Aug 15 '18 at 09:29
surely you need to construct it as bytes. Check this question: https://stackoverflow.com/questions/16031620/how-can-i-generate-a-non-utf-8-character-set — Alan Deep, Aug 15 '18 at 09:30
Strings and characters in Java itself are always UTF-16... Encoding is only relevant when transforming to or from bytes. What exactly do you want to achieve? — Mark Rotteveel, Aug 15 '18 at 09:31
@JBNizet *since a String contains characters, not bytes*, well *internally though* is a different story since java-9 — Eugene, Aug 15 '18 at 09:35
@Eugene that is an implementation detail. The public API of a String allows getting characters, and all Unicode code points are supported. How these code points are stored internally doesn't really matter here. — JB Nizet, Aug 15 '18 at 09:37
@Eugene Well, you are right that the in-memory encoding changed, but characters are still UTF-16 in their numbering and usage (eg surrogate pairs etc). — Mark Rotteveel, Aug 15 '18 at 09:40
@MarkRotteveel well if there is a surrogate pair it is UTF-16 already - 2 bytes, if there isn't it's LATIN_1, thus a single byte — Eugene, Aug 15 '18 at 09:42
@Eugene You're talking about the in-memory encoding again. I'm talking about what characters represent. — Mark Rotteveel, Aug 15 '18 at 09:45
Do you want an invalid UTF-8 `byte[]` to test if you can convert it to `String`? — Rob Audenaerde, Aug 15 '18 at 09:52
https://stackoverflow.com/questions/1301402/example-invalid-utf8-string — Rob Audenaerde, Aug 15 '18 at 09:53
Thanks everyone. I've added some details about the flow I have. — Johnny, Aug 15 '18 at 11:35
"jsonReader … should parse non-UTF correctly": Whatever do you mean? JSON documents exchanged between system are required to be UTF-8-encoded. Isn't invalid simply invalid? Send it back. — Tom Blodget, Aug 18 '18 at 22:46
@TomBlodget that's the reason I'm trying to test it - we received a non-UTF value from the JSON (it's a lot of files which holds multilingual translations to our product). — Johnny, Aug 19 '18 at 07:46
So wouldn't the correct behavior for a file that fails to meet the expectation that it can be decoded to text using UTF-8 an exception? Then wouldn't the correct behavior for text that fails to meet the expectation that it's valid JSON be an exception? — Tom Blodget, Aug 20 '18 at 16:53
@TomBlodget exactly, but for that I need to create file with non-valid UTF8 content and verify it in test. That' exactly what my q is about. — Johnny, Aug 20 '18 at 22:25

score 2 · Accepted Answer · answered Aug 15 '18 at 09:39

Java made the following design decision, after the catastrophes with encoding in C/C++ (at that point in history):

String, char, Reader, Writer are for handling Unicode text, char is UTF-16, two bytes.
byte[], InputStream, OutputStream are for binary data, that given some encoding/Charset, could be text.

So you actually can only abuse String/char, and it is almost guaranteed that there will be corrupted data (some chars have special structuring meaning for UTF-*).

The solution is to encode the binary data in for instance Base64.

byte[] b = ...
String s = Base64.getEncoder().encode(b);

There are some different encoders, and you can set properties like line wrapping, padding and so on.

Or you might be more comfortable with a hexadecimal representation.

How can I generate non-UTF-8 string / char in Java for testing purposes?

1 Answers1

Linked