What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

Question

I am integration testing a component. The component allows you to save and fetch strings.

I want to verify that the component is handling UTF-8 characters properly. What is the minimum test that is required to verify this?

I think that doing something like this is a good start:

// This is the ☺ character
String toSave = "\u263A";
int id = 123;

// Saves to Database
myComponent.save( id, toSave );

// Retrieve from Database
String fromComponent = myComponent.retrieve( id );

// Verify they are same 
org.junit.Assert.assertEquals( toSave, fromComponent );

One mistake I have made in the past is I have set String toSave = "è". My test passed because the string was saved and retrieved properly to/from the DB. Unfortunately the application was not actually working correctly because the app was using ISO 8859-1 encoding. This meant that è worked but other characters like ☺ did not.

Question restated: What is the minimum test (or tests) to verify that I can persist UTF-8 encoded strings?

You could test with unicode characters that take 1,2,3 or 4 bytes, but it's quite hard to say what is the "minimum" test without knowing anything about the application. Maybe you should also test with long texts that are around the length of the DB column in question (assuming it's a DB table the data is stored in). — Mick Mnemonic, Apr 21 '17 at 20:30
How about a string with every non-combing codepoint, each followed by a combining codepoint? — Tom Blodget, Apr 21 '17 at 22:28
@TomBlodget - Sorry I am not sure what that means. Could you provide an example in an answer? — sixtyfootersdude, Apr 21 '17 at 22:34

score 3 · Answer 1 · edited May 23 '17 at 12:34

A code and/or documentation review is probably your best option here. But, you can probe if you want. It seems that a sufficient test is the goal and minimizing it is less important. It is hard to figure what a sufficient test is, based only on speculation of what the threat would be, but here's my suggestion: all codepoints, including U+0000, proper handling of "combining characters."

The method you want to test has a Java string as a parameter. Java doesn't have "UTF-8 encoded strings": Java's native text datatypes use the UTF-16 encoding of the Unicode character set. This is common for in-memory representations of text—It's used by Java, .NET, JavaScript, VB6, VBA,…. UTF-8 is commonly used for streams and storage, so it makes sense that you should ask about it in the context of "saving and fetching". Databases typically offer one or more of UTF-8, 3-byte-limited UTF-8, or UTF-16 (NVARCHAR) datatypes and collations.

The encoding is an implementation detail. If the component accepts a Java string, it should either throw an exception for data it is unwilling to handle or handle it properly.

"Characters" is a rather ill-defined term. Unicode codepoints range from 0x0 to 0x10FFFF—21 bits. Some codepoints are not assigned (aka "defined"), depending on the Unicode Standard revision. Java datatypes can handle any codepoint, but information about them is limited by version. For Java 8, "Character information is based on the Unicode Standard, version 6.2.0.". You can limit the test to "defined" codepoints or go all possible codepoints.

A codepoint is either a base "character" or a "combining character". Also, each codepoint is in exactly one Unicode Category. Two categories are for combining characters. To form a grapheme, a base character is followed by zero or more combining characters. It might be difficult to layout graphemes graphically (see Zalgo text) but for text storage all that it is needed to not mangle the sequence of codepoints (and byte order, if applicable).

So, here is a non-minimal, somewhat comprehensive test:

final Stream<Integer> codepoints = IntStream
    .rangeClosed(Character.MIN_CODE_POINT, Character.MAX_CODE_POINT)
    .filter(cp -> Character.isDefined(cp)) // optional filtering
    .boxed();              
final int[] combiningCategories = { 
    Character.COMBINING_SPACING_MARK, 
    Character.ENCLOSING_MARK 
};
final Map<Boolean, List<Integer>> partitionedCodepoints = codepoints
    .collect(Collectors.partitioningBy(cp -> 
        Arrays.binarySearch(combiningCategories, Character.getType(cp)) < 0));
final Integer[] baseCodepoints = partitionedCodepoints.get(true)
    .toArray(new Integer[0]); 
final Integer[] combiningCodepoints = partitionedCodepoints.get(false)
    .toArray(new Integer[0]);
final int baseLength = baseCodepoints.length;
final int combiningLength = combiningCodepoints.length;
final StringBuilder graphemes = new StringBuilder();
for (int i = 0; i < baseLength; i++) {
    graphemes.append(Character.toChars(baseCodepoints[i])); 
    graphemes.append(Character.toChars(combiningCodepoints[i % combiningLength])); 
}
final String test = graphemes.toString();
final byte[] testUTF8 = StandardCharsets.UTF_8.encode(test).array();

// Java 8 counts for when filtering by Character.isDefined 
assertEquals(736681, test.length());  // number of UTF-16 code units
assertEquals(3241399, testUTF8.length); // number of UTF-8 code units

score 1 · Accepted Answer · answered May 01 '17 at 05:12

If your component is only capable of storing and retrieving strings, then all you need to do is make sure that nothing gets lost in the conversion to and from the Unicode strings of java and the UTF-8 strings that the component stores.

That would involve checking with at least one character from each UTF-8 code point length. So, I would suggest check with:

One character from the US-ASCII set, (1-byte long code point,) then
One character from Greek, (2-byte long code point,) and
One character from Chinese (3-byte long code point.)
In theory you would also want to check with an emoji (4-byte long code point,) though these cannot be represented in java's Unicode strings, so it's moot point.

A useful extra test would be to try a string combining at least one character from each of the above cases, so as to make sure that characters of different code-point lengths can co-exist within the same string.

(If your component does anything more than storing and retrieving strings, like searching for strings, then things can get a bit more complicated, but it seems to me that you specifically avoided asking about that.)

I do believe that black box testing is the only kind of testing that makes sense, so I would not recommend polluting the interface of your component with methods that would expose knowledge of its internals. However, there are two things that you can do to increase the testability of the component without ruining its interface:

Introduce additional functions to the interface that might help with testing without disclosing anything about the internal implementation and without requiring that the testing code must have knowledge of the internal implementation of the component.
Introduce functionality useful for testing in the constructor of your component. The code that constructs the component knows precisely what component it is constructing, so it is intimately familiar with the nature of the component, so it is okay to pass something implementation-specific there.

An example of what you could do with any of the above techniques would be to artificially severely limit the number of bytes that the internal representation is allowed to occupy, so that you can make sure that a certain string you are planning to store will fit. So, you could limit the internal size to no more than 9 bytes, and then make sure that a java unicode string containing 3 chinese characters gets properly stored and retrieved.

`System.out.println("");` works just fine in my Oracle Java 8 (1.8). (I would think it would in any version.) — Tom Blodget, May 01 '17 at 16:56
@TomBlodget that's interesting. Thanks for sharing. Either it is not in the unicode range above U+10000 or there is something fundamentally wrong in my understanding of how java stores strings. I will look into it when I get a chance. In the mean time, if it is easy for you, could you please post the representation of a string containing this character as a) `getBytes( StandardCharsets.UTF_8 )` and b) `getBytes( StandardCharsets.UTF_16 )`? — Mike Nakis, May 01 '17 at 18:16
Also of interest: https://github.com/minimaxir/big-list-of-naughty-strings — Mike Nakis, Nov 16 '17 at 12:59

davidxxx · Answer 3 · 2017-04-21T20:37:01.443

0

String instances use a predefined and unchangeable encoding(16-bit words).
So, returning only a String from your service is probably not enough to do this check.
You should try to return the byte representation of the persisted String (a byte array for example) and compare the content of this array with the "\u263A" String that you would encode in bytes with the UTF-8 charset.

String toSave = "\u263A";  
int id = 123;

// Saves to Database
myComponent.save(id, toSave );

// Retrieve from Database
byte[] actualBytes = myComponent.retrieve(id );

// assertion
byte[] expectedBytes = toSave.getBytes(Charset.forName("UTF-8"));
Assert.assertTrue(Arrays.equals(expectedBytes, actualBytes));

edited Apr 21 '17 at 20:37

answered Apr 21 '17 at 20:28

davidxxx

125,838
23
214
215

I am doing black box testing so I can't add a new method to `myComponent`. – sixtyfootersdude Apr 21 '17 at 20:59
Yet, you should have a way to know how the field is stored (with a java method or even by querying directly the String content where it is stored). A not testable code cannot produce a tested code. – davidxxx Apr 21 '17 at 21:57

What is the minimum test to verify that a component can save/retrieve UTF8 encoded strings

3 Answers3