Questions tagged [noncharacter]

For questions about Unicode non-characters - the code points that are guaranteed to never be used for a character in the Unicode Standard. When using this tag also tag the language that you are coding in for context if applicable. Where possible include the more generic [unicode] tag too.

14 questions
102
votes
3 answers

Really Good, Bad UTF-8 example test data

So we have the XSS cheat sheet to test our XSS filtering - but other than an example benign page I can't find any evil or malformed test data to make sure that my UTF-8 code can handle missbehaving data. Where can I find some good uh.. bad data to…
Xeoncross
  • 55,620
  • 80
  • 262
  • 364
56
votes
2 answers

What's the purpose of the noncharacters U+FDD0 to U+FDEF?

U+FFFE needs to be a noncharacter in order to allow the Byte Order Mark to work. U+FFFF is described in The Unicode Standard as "useful for internal purposes as sentinels". Makes sense. But I can't figure out, and The Unicode Standard doesn't…
dan04
  • 87,747
  • 23
  • 163
  • 198
26
votes
4 answers

Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?

Here's an excerpt from java.text.CharacterIterator documentation: This interface defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methods previous() and next() are…
polygenelubricants
  • 376,812
  • 128
  • 561
  • 623
3
votes
1 answer

What is a "noncharacter" in unicode?

I don't know what "noncharacter" characters are. They are forbidden unicode characters, though I can copy and paste them, like U+FFFF (). If a character has a fixed position in Unicode, and can be used to display something, then: Why are those…
1
vote
4 answers

Unicode Noncharacters

Is there a good resource for finding the last two characters of each plane, particularly planes 3–13? Obviously 0xFFFE and 0xFFFF is a non character, as well as 0x10FFFE and 0x10FFFF, but I can't find a complete list as to where the last characters…
Joe Caraccio
  • 1,899
  • 3
  • 24
  • 41
1
vote
0 answers

Detecting non-character Unicode characters

I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen. There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for…
zneak
  • 134,922
  • 42
  • 253
  • 328
1
vote
1 answer

Why are certain characters prohibited in the HTML5 spec?

According to the HTML5 spec (just after the table), the following characters are prohibited: Otherwise, return a character token for the Unicode character whose code point is that number. Additionally, if the number is in the range 0x0001 to…
Daniel Fath
  • 16,453
  • 7
  • 47
  • 82
0
votes
1 answer

How can I get a 'Group Seperator', 0x1D, character from Text box or Rich Textbox or ETC. C#

I use a USB 2D barcode scanner scan a GS1 Datamatix to key in the barcode text via USB to a computer like a keyboard. The text uses 'Group Seperator', 0x1D, character as a delimiter. When I put cursor in a Hex/Text editor then scan, the 'Group…
0
votes
1 answer

How do I go about converting text with control characters to properly formatted text in Intellij

I'm trying to take some text that's in a format where all the spacing, tabs, newlines (control-characters - NPCs) are present. And have it output in a file in Intellij as those control characters would dictate they be formatted. I may be going about…
RatavaWen
  • 147
  • 1
  • 8
0
votes
1 answer

Is this Google Closure UTF-8 string valid?

In the Google Closure UTF-8 to byte array tests is the string \u0000\u007F\u0080\u07FF\u0800\uFFFF which is supposed to be converted to the array [0x00, 0x7F, 0xC2, 0x80, 0xDF, 0xBF, 0xE0, 0xA0, 0x80, 0xEF, 0xBF, 0xBF] I've tried a few other…
James McLachlan
  • 1,368
  • 13
  • 27
0
votes
1 answer

Strip invalid and noncharacters from utf8

I'm loading some data, processing it, then sending data to an application which (fair enough) doesn't allow the invalid utf8 noncharacters U+FDD0 through U+FDEF, as well as the invalid U+FFFE and U+FFFF special characters. My raw data is out of my…
0
votes
1 answer

Which unicode code can be used safely as reserved value?

Background I am writing a DFA based regex parser, for performance reasons, I need to use a dictionary [Unicode.Scalar : State] to map the next states. Now I need a bunch of special unicode values to represent special character expressions like .,…
dawnstar
  • 507
  • 5
  • 10
0
votes
1 answer

Why are the two last points on supplemental PUAs excluded?

The supplemental PUAs (F0000-FFFFD and 100000 10FFFD) has explicitely excluded FFFFE, FFFFF, 10FFFE and 10FFFF by defining them as non-characters. Why was this done? Without this they would be nice 65536-point blocks.
skyking
  • 13,817
  • 1
  • 35
  • 57
0
votes
1 answer

Which nonnegative integers aren't assigned a character in the UCS?

Coded character sets, as defined by the Unicode Character Encoding Model, map characters to nonnegative integers (e.g. LATIN SMALL LETTER A to 97, both by traditional ASCII and the UCS). Note: There's a difference between characters and abstract…
djsp
  • 2,174
  • 2
  • 19
  • 40