For questions about Unicode non-characters - the code points that are guaranteed to never be used for a character in the Unicode Standard. When using this tag also tag the language that you are coding in for context if applicable. Where possible include the more generic [unicode] tag too.
Questions tagged [noncharacter]
14 questions
102
votes
3 answers
Really Good, Bad UTF-8 example test data
So we have the XSS cheat sheet to test our XSS filtering - but other than an example benign page I can't find any evil or malformed test data to make sure that my UTF-8 code can handle missbehaving data.
Where can I find some good uh.. bad data to…

Xeoncross
- 55,620
- 80
- 262
- 364
56
votes
2 answers
What's the purpose of the noncharacters U+FDD0 to U+FDEF?
U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work.
U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense.
But I can't figure out, and The Unicode Standard doesn't…

dan04
- 87,747
- 23
- 163
- 198
26
votes
4 answers
Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?
Here's an excerpt from java.text.CharacterIterator documentation:
This interface defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methods previous() and next() are…

polygenelubricants
- 376,812
- 128
- 561
- 623
3
votes
1 answer
What is a "noncharacter" in unicode?
I don't know what "noncharacter" characters are. They are forbidden unicode characters, though I can copy and paste them, like U+FFFF (). If a character has a fixed position in Unicode, and can be used to display something, then:
Why are those…

InfiniteUniverse
- 31
- 1
1
vote
4 answers
Unicode Noncharacters
Is there a good resource for finding the last two characters of each plane, particularly planes 3–13?
Obviously 0xFFFE and 0xFFFF is a non character, as well as 0x10FFFE and 0x10FFFF, but I can't find a complete list as to where the last characters…

Joe Caraccio
- 1,899
- 3
- 24
- 41
1
vote
0 answers
Detecting non-character Unicode characters
I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.
There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for…

zneak
- 134,922
- 42
- 253
- 328
1
vote
1 answer
Why are certain characters prohibited in the HTML5 spec?
According to the HTML5 spec (just after the table), the following characters are prohibited:
Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to…

Daniel Fath
- 16,453
- 7
- 47
- 82
0
votes
1 answer
How can I get a 'Group Seperator', 0x1D, character from Text box or Rich Textbox or ETC. C#
I use a USB 2D barcode scanner scan a GS1 Datamatix to key in the barcode text via USB to a computer like a keyboard. The text uses 'Group Seperator', 0x1D, character as a delimiter.
When I put cursor in a Hex/Text editor then scan, the 'Group…

Kritsada Tattanon
- 111
- 2
- 5
0
votes
1 answer
How do I go about converting text with control characters to properly formatted text in Intellij
I'm trying to take some text that's in a format where all the spacing, tabs, newlines (control-characters - NPCs) are present. And have it output in a file in Intellij as those control characters would dictate they be formatted.
I may be going about…

RatavaWen
- 147
- 1
- 8
0
votes
1 answer
Is this Google Closure UTF-8 string valid?
In the Google Closure UTF-8 to byte array tests is the string
\u0000\u007F\u0080\u07FF\u0800\uFFFF
which is supposed to be converted to the array
[0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF]
I've tried a few other…

James McLachlan
- 1,368
- 13
- 27
0
votes
1 answer
Strip invalid and noncharacters from utf8
I'm loading some data, processing it, then sending data to an application which (fair enough) doesn't allow the invalid utf8 noncharacters U+FDD0 through U+FDEF, as well as the invalid U+FFFE and U+FFFF special characters.
My raw data is out of my…

Amedee d'Aboville
- 95
- 7
0
votes
1 answer
Which unicode code can be used safely as reserved value?
Background
I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State] to map the next states. Now I need a bunch of special unicode values to represent special character expressions like .,…

dawnstar
- 507
- 5
- 10
0
votes
1 answer
Why are the two last points on supplemental PUAs excluded?
The supplemental PUAs (F0000-FFFFD and 100000 10FFFD) has explicitely excluded FFFFE, FFFFF, 10FFFE and 10FFFF by defining them as non-characters. Why was this done? Without this they would be nice 65536-point blocks.

skyking
- 13,817
- 1
- 35
- 57
0
votes
1 answer
Which nonnegative integers aren't assigned a character in the UCS?
Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS).
Note: There's a difference between characters and abstract…

djsp
- 2,174
- 2
- 19
- 40