Really Good, Bad UTF-8 example test data

Question

So we have the XSS cheat sheet to test our XSS filtering - but other than an example benign page I can't find any evil or malformed test data to make sure that my UTF-8 code can handle missbehaving data.

Where can I find some good uh.. bad data to test with? Or what is a tricky sequence of chars?

http://www.columbia.edu/kermit/utf8.html is another good one — Xeoncross, Dec 06 '10 at 15:29
ăѣծềſģȟᎥǩľḿꞑȯȶψ1234567890!@#$%^&*()-_=+[{]};:'",<.>/?`~Ḇ٤ḞԍНǏƘԸⲘ০ΡɌȚЦѠƳȤѧᖯćễႹļṃŉоᵲꜱừŵź1234567890!@#$%^&*()-_=+[{]};:'",<.>/?`~АḂⲤꞠꓧȊꓡǬŖꓫŸảƀḋếᵮℊᎥкιṃդⱺŧṽẉყž1234567890!@#$%^&*()-_=+[{]};:'",<.>/?`~ѦƇᗞΣℱԍҤ١КƝȎṚṮṺƲᏔꓫᏏçძḧҝɭḿṛтúẃ⤬1234567890!@#$%^&*()-_=+[{]};:'",<.>/?`~ΒĢȞỈꓗʟℕ০ՀꓢṰǓⅤⲬ — Andrew, Feb 10 '19 at 14:41
Check out [Markus Kuhn’s UTF-8 decoder stress test](http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt) — zildjohn01, Aug 23 '09 at 19:33
I'd warn you his test is based on an outdated definition of UTF-8, when 5 and 6 byte sequences were allowed, before planes 17 and above were deleted. And it implies that codepoints U+FFFE and U+FFFF are invalid in UTF-8, when [per the Unicode consortium they are not](http://www.unicode.org/faq/private_use.html#nonchar8) — Simon Kissane, Feb 23 '14 at 10:41

score 41 · Answer 1 · edited May 23 '17 at 10:31

See also How does a file with Chinese characters know how many bytes to use per character? — no doubt, there are other SO questions that would also help.

In UTF-8, you get the following types of bytes:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation bytes (1-3 continuation bytes)
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF4   First byte of a 4-byte character encoding

(The last line looks as if it should read 0xF0..0xF7; however, the 21-bit range of Unicode (U+0000 - U+10FFFF) means that the maximum valid value is 0xF4; values 0xF5..0xF7 cannot occur in valid UTF-8.)

Looking at whether a particular sequence of bytes is valid UTF-8 means you need to think about:

Continuation bytes appearing where not expected
Non-continuation bytes appearing where a continuation byte is expected
Incomplete characters at end of string (variation of 'continuation byte expected')
Non-minimal sequences
UTF-16 surrogates

In valid UTF-8, the bytes 0xF5..0xFF cannot occur.

Non-minimal sequences

There are multiple possible representations for some characters. For example, the Unicode character U+0000 (ASCII NUL) could be represented by:

0x00
0xC0 0x80
0xE0 0x80 0x80
0xF0 0x80 0x80 0x80

However, the Unicode standard clearly states that the last three alternatives are not acceptable because they are not minimal. It so happens that the bytes 0xC0 and 0xC1 can never appear in valid UTF-8 because the only characters that could be encoded by those are minimally encoded as single byte characters in the range 0x00..0x7F.

UTF-16 Surrogates

Within the Basic Multi-lingual Plane (BMP), the Unicode values U+D800 - U+DFFF are reserved for UTF-16 surrogates and cannot appear encoded in valid UTF-8. If they were valid in UTF-8 (which, I emphasize, they are not), then the surrogates would be encoded:

U+D800 — 0xED 0xA0 0x80 (smallest high surrogate)
U+DBFF — 0xED 0xAF 0xBF (largest high surrogate)
U+DC00 — 0xED 0xB0 0x80 (smallest low surrogate)
U+DFFF — 0xED 0xBF 0xBF (largest low surrogate)

Bad Data

So, your BAD data should contain samples violating these various prescriptions.

Continuation byte not preceded by one of the initial byte values
Multi-character initial bytes not followed by enough continuation bytes
Non-minimal multi-byte characters
UTF-16 surrogates
Invalid bytes (0xC0, 0xC1, 0xF5..0xFF).

Note that a byte-order mark (BOM) U+FEFF, aka zero-width no-break space (ZWNBSP), cannot appear unencoded in UTF-8 — the bytes 0xFF and 0xFE are not permitted in valid UTF-8. An encoded ZWNBSP can appear in a UTF-8 file as 0xEF 0xBB 0xBF, but the BOM is completely superfluous in UTF-8.

There are also some noncharacters in Unicode. U+FFFE and U+FFFF are two such noncharacters (and the last two code points in each plane, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... U+10FFFE, U+10FFFF are others). These should not normally appear in Unicode data for data exchange, but can appear in private use. See the Unicode FAQ link for lots of sordid details, including the rather complex history of noncharacters in Unicode. (Corrigendum #9: Clarification About Noncharacters, which was released in January 2013, does what its title suggests — clarifies the meaning of non-characters.)

Thanks for this great list. I plan on checking each of these out in more detail now. — Xeoncross, Aug 25 '09 at 02:54
The comment that non-characters "should not appear in UTF-8 encoded data" is misleading. Non-characters should not appear in UTF-8 encoded data _intended for open interchange_, but nonetheless [should be accepted by UTF-8 encoders/decoders](http://www.unicode.org/faq/private_use.html#nonchar8) — Simon Kissane, Feb 23 '14 at 12:03
@SimonKissane: Apparently, I was one of the many confused by the status quo ante [Corrigendum #9](http://www.unicode.org/versions/corrigendum9.html), which was released in January 2013, it seems. The whole section of the Unicode FAQ on [noncharacters](http://www.unicode.org/faq/private_use.html#noncharacters) is worth a read. Thanks for the info. (I'll also note that my comments says 'should' which agrees with what the Unicode standard said (but not 'says'); the intention is that they should not appear in 'open interchange' but can be used for 'internal use'. ) — Jonathan Leffler, Feb 23 '14 at 16:04
@AdrianMaire: See table 3.6 in [Chapter 3](http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf) of the Unicode (9.0.0) standard (page number 125; p54 of the PDF file). I'm not sure which other sources you're consulting, but I think what I've said is covered in that table. — Jonathan Leffler, Mar 02 '17 at 07:12
@JonathanLeffler You are 100% correct, Thanks you for the reference. — Adrian Maire, Mar 02 '17 at 07:15

score 17 · Answer 2 · edited Feb 10 '19 at 14:39

17

You can use this handy online tool from Jeffrey Bergamini to convert any text into a really weird UTF8 string of Homoglyphs.

A typical

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

become like this:

Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ, șếᶑ ᶁⱺ ẽḭŭŝḿꝋď ṫĕᶆᶈṓɍ ỉñḉīḑȋᵭṵńť ṷŧ ḹẩḇőꝛế éȶ đꝍꞎôꝛȇ ᵯáꞡᶇā ąⱡîɋṹẵ.

edited Feb 10 '19 at 14:39

Andrew

18,680
13
103
118

answered Dec 15 '16 at 15:08

Shebuka

3,148
1
26
43

7

I suppose it is because this do not really help to test UTF8: you do not get anything close to the full set of cases, there are no "bad" cases and the format is not really helpful for testing. It is only a way to get strange characters. – Adrian Maire Mar 02 '17 at 07:09
Have you tried it? That generator is not for fun. It gives you characters from full UTF-8 range, and because they are strangely similar to actual characters you can 'see' what chars are giving you problems. In example i've posted there are 6 chars that my iPhone render as boxed question marks. – Shebuka Mar 02 '17 at 11:41
6

IMO, this wonderful tool could have been a very nice "Added value" to an explanation, but does not fit as an answer by itself in SO (also because the page may be discontinued). Anyway, I agree that a -1 without explanation is not very constructive. – Adrian Maire Mar 02 '17 at 11:51
So this is "good, good utf-8 example test data"... worth an upvote as it related, IMO – Rondo Jun 04 '18 at 17:33

Douglas Leeder · Answer 3 · 2009-08-23T17:32:19.260

2

Off the top of my head:

0xff and 0xfe

Single high-bit bytes

Multi-byte representation of low-byte characters - A good way of smuggling nulls past early checks

Byte-order marks - Are you going to ignore them?

NFC vs. NFD

edited Aug 23 '09 at 17:32

answered Aug 23 '09 at 17:22

Douglas Leeder

52,368
9
94
137

Really Good, Bad UTF-8 example test data

3 Answers3

Non-minimal sequences

UTF-16 Surrogates

Bad Data

Linked