Detecting non-character Unicode characters

Question

I'm working on an application that eventually reads and prints arbitrary and untrustable Unicode characters to the screen.

There are a number of ways to wreck havoc using Unicode strings, and I would like my program to behave correctly for "dangerous" strings. For instance, the RTL override character will make strings look like they're backwards.

Since the audience is mostly programmers, my solution would be to, first, get the type C canonical form of the string, and then replace anything that's not a printable character on its own with the Unicode code point in the form \uXXXXXX. (The intent is not to have a perfectly accurate representation of the string, it is to have a mostly good representation. The full string data is still available.)

My problem, then, is determining what's an actual printable character and what's a non-printable character. Swift has a Character class, but contrary to, say, Java's Character class, the Swift one doesn't seem to have any method to find out the classification of a character.

How could I carry that plan? Is there anything else I should consider?

"Anything not ASCII" too broad? You can always parse the latest main Unicode data file (usually called `unicode.txt` if memory serves) and compile a few lists of it yourself. Oddities such as "not-a-valid character" and "not-displayable" are clearly marked in it. — Jongware, Jun 22 '15 at 21:22
@Jongware, 'anything not ASCII' would probably work, but as a native French speaker, I always find the approach really crude and lazy. — zneak, Jun 22 '15 at 21:23
.. By the way, you cannot consider Arabic and Hebrew "safe to display" but the RTL marker *not*. (Perhaps then only parse it when actually followed by an RTL language.) — Jongware, Jun 22 '15 at 21:23
Found it: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. I've used older versions for other purposes; it contains lots of abbreviations, but is really well worth looking in to. — Jongware, Jun 22 '15 at 21:27
You should read [Chapter 23](http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf) of the Unicode Standard, which describes some of the characters you may want to call "dangerous". Unicode does not have a concept for what is or is not "printable", but you may have some luck identifying problematic characters by looking at their rendered bounding box. — 一二三, Jun 23 '15 at 02:29

Detecting non-character Unicode characters

0 Answers0