1

I think I have looked everywhere. Have found some examples in Ruby but nothing coherent in Java.

How can I generate non-UTF-8 string / char in Java for testing purposes?

Specifically, I have a json file that holds different key-values related to some translations mechanism we use (so, a lot of languages involved), this json parsed with some mapper that we have.

I want to check if the mapper returns the proper values in case a non-UTF data encountered in the json.

So, I would like to use it in approach like:

String expectedValue = "FooBarNonUtf8";
String actualValue = jsonReader.readFrom("file", "key"); //should parse non-UTF correctly
assertEquals(expectedValue, actualValue);
Johnny
  • 14,397
  • 15
  • 77
  • 118
  • 3
    what is your definition of *non UTF-8*? – Eugene Aug 15 '18 at 09:27
  • Since UTF-8 is meant to comprise all characters (even emojis) it's a little hard to generate one that isn't in UTF-8. Or do you mean another encoding, like Latin-1, i.e. any encoding except UTF-8? – Thomas Aug 15 '18 at 09:29
  • 1
    UTF-8 is a character encoding. It transforms characters (which don't have any encoding), into bytes. Not sure what you mean by "non-UTF-8 string", since a String contains characters, not bytes. – JB Nizet Aug 15 '18 at 09:29
  • 1
    surely you need to construct it as bytes. Check this question: https://stackoverflow.com/questions/16031620/how-can-i-generate-a-non-utf-8-character-set – Alan Deep Aug 15 '18 at 09:30
  • 1
    Strings and characters in Java itself are always UTF-16... Encoding is only relevant when transforming to or from bytes. What exactly do you want to achieve? – Mark Rotteveel Aug 15 '18 at 09:31
  • @JBNizet *since a String contains characters, not bytes*, well *internally though* is a different story since java-9 – Eugene Aug 15 '18 at 09:35
  • @MarkRotteveel *not* since java-9 – Eugene Aug 15 '18 at 09:36
  • 2
    @Eugene that is an implementation detail. The public API of a String allows getting characters, and all Unicode code points are supported. How these code points are stored internally doesn't really matter here. – JB Nizet Aug 15 '18 at 09:37
  • @JBNizet right, reason I said *internally* – Eugene Aug 15 '18 at 09:37
  • @Eugene Well, you are right that the in-memory encoding changed, but characters are still UTF-16 in their numbering and usage (eg surrogate pairs etc). – Mark Rotteveel Aug 15 '18 at 09:40
  • @MarkRotteveel well if there is a surrogate pair it is UTF-16 already - 2 bytes, if there isn't it's LATIN_1, thus a single byte – Eugene Aug 15 '18 at 09:42
  • @Eugene You're talking about the in-memory encoding again. I'm talking about what characters represent. – Mark Rotteveel Aug 15 '18 at 09:45
  • Do you want an invalid UTF-8 `byte[]` to test if you can convert it to `String`? – Rob Audenaerde Aug 15 '18 at 09:52
  • https://stackoverflow.com/questions/1301402/example-invalid-utf8-string – Rob Audenaerde Aug 15 '18 at 09:53
  • Thanks everyone. I've added some details about the flow I have. – Johnny Aug 15 '18 at 11:35
  • "jsonReader … should parse non-UTF correctly": Whatever do you mean? JSON documents exchanged between system are required to be UTF-8-encoded. Isn't invalid simply invalid? Send it back. – Tom Blodget Aug 18 '18 at 22:46
  • @TomBlodget that's the reason I'm trying to test it - we received a non-UTF value from the JSON (it's a lot of files which holds multilingual translations to our product). – Johnny Aug 19 '18 at 07:46
  • So wouldn't the correct behavior for a file that fails to meet the expectation that it can be decoded to text using UTF-8 an exception? Then wouldn't the correct behavior for text that fails to meet the expectation that it's valid JSON be an exception? – Tom Blodget Aug 20 '18 at 16:53
  • @TomBlodget exactly, but for that I need to create file with non-valid UTF8 content and verify it in test. That' exactly what my q is about. – Johnny Aug 20 '18 at 22:25

1 Answers1

2

Java made the following design decision, after the catastrophes with encoding in C/C++ (at that point in history):

  • String, char, Reader, Writer are for handling Unicode text, char is UTF-16, two bytes.
  • byte[], InputStream, OutputStream are for binary data, that given some encoding/Charset, could be text.

So you actually can only abuse String/char, and it is almost guaranteed that there will be corrupted data (some chars have special structuring meaning for UTF-*).

The solution is to encode the binary data in for instance Base64.

byte[] b = ...
String s = Base64.getEncoder().encode(b);

There are some different encoders, and you can set properties like line wrapping, padding and so on.

Or you might be more comfortable with a hexadecimal representation.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138