2

I am writing a text editor which has an option to display a bullet in place of any invisible Unicode character. Unfortunately there appears to be no easy way to determine whether a Unicode character is invisible.

I need to find a text file containing every Unicode character in order that I can look through for invisible characters. Would anyone know where I can find such a file?

EDIT: I am writing this app in Cocoa for Mac OS X.

mdb
  • 52,000
  • 11
  • 64
  • 62
titaniumdecoy
  • 18,900
  • 17
  • 96
  • 133
  • By "invisible" do you mean a glyph that isn't available in the selected font? Or something else, like characters that are part of a composite? – Jason Coco Nov 20 '08 at 06:32
  • I mean characters that do not appear on the screen. I want to replace them with bullets so that users can tell that they are there. – titaniumdecoy Nov 20 '08 at 07:01
  • I added another answer which may help answer the other question... if not, let me know. – Jason Coco Nov 20 '08 at 09:11

7 Answers7

3

Oh, I see... actual invisble characters ;) This FAQ will probably be useful:

http://www.unicode.org/faq/unsup_char.html

It lists the current invisible codepoints and has other information that you might find helpful.

EDIT: Added some Cocoa-specific information

Since you're using Cocoa, you can get the unicode character set for control characters and compare against that:

NSCharacterSet* controlChars = [NSCharacterSet controlCharacterSet];

You might also want to take a look at the FAQ link I posted above and add any characters that you think you may need based on the information there to the character set returned by controlCharacterSet.

EDIT: Added an example of creating a Unicode string from a Unicode character

unichar theChar = 0x000D;
NSString* thestring = [NSStirng stringWithCharacters:&theChar length:1];
Jason Coco
  • 77,985
  • 20
  • 184
  • 180
  • Not all invisible characters are control characters. I think you would consider the zero-width characters invisible, but they aren't control characters. It also doesn't include the Unicode LINE SEPARATOR and PARAGRAPH SEPARATOR characters. – Peter Hosey Nov 20 '08 at 12:35
  • @Peter, Right, which is why I posted the FAQ first and suggested that the appropriate characters be added to the controlCharacterSet. – Jason Coco Nov 20 '08 at 19:01
1

Let me know if this code helps at all:

-(NSString*)stringByReplacingControlCharacters:(NSString*)originalString
{
    NSUInteger length = [originalString length];
    unichar *strAsUnichar = (unichar*)malloc(length*sizeof(unichar));
    NSCharacterSet* controlChars = [NSCharacterSet controlCharacterSet];
    unichar bullet = 0x2022;

    [originalString getCharacters:strAsUnichar];
    for( NSUInteger i = 0; i < length; i++ ) {
        if( [controlChars characterIsMember:strAsUnichar[i]] )
            strAsUnichar[i] = bullet;
    }

    NSString* newString = [NSString stringWithCharacters:strAsUnichar length:length];
    free(strAsUnichar);

    return newString;
}

Important caveats:

This probably isn't the most efficient way of doing this, so you will have to decide how you want to optimize after you get it working. This only works with characters on the BMP, support for composted characters would have to be added if you have such a requirement. This does no error checking at all.

Jason Coco
  • 77,985
  • 20
  • 184
  • 180
  • I appreciate your posting that code. It may well come in handy. However, the problem is that I am quite certain that the controlCharacterSet is only a small subset of all invisible characters. – titaniumdecoy Nov 20 '08 at 20:26
  • You might be interested in the code I am currently using, which can be found here: http://stackoverflow.com/questions/300086/display-hidden-characters-in-nstextview – titaniumdecoy Nov 20 '08 at 20:27
  • It's actually not a small subset, but it's definitely not all the invisible characters. You can look at the Unicode site for the full list of various characters, but some invisible characters you probably *dont* want to turn into bullets, like joiners and such. – Jason Coco Nov 20 '08 at 23:11
0

A good place to start is the Unicode Consortium itself which provides a large body of data, some of which would be what you're looking for.

I'm also in the process of producing a DLL which you give a string and it gives back the UCNs of each character. But don't hold your breath.

bugmagnet
  • 7,631
  • 8
  • 69
  • 131
0

The current official Unicode version is 5.1.0, and text files describing all of the code points in that can be found at http://www.unicode.org/standard/versions/components-latest.html

Alnitak
  • 334,560
  • 70
  • 407
  • 495
0

For Java, java.lang.Character.getType. For C, u_charType() or u_isgraph().

Eugene Yokota
  • 94,654
  • 45
  • 215
  • 319
0

you might find this code to be of interest: http://gavingrover.blogspot.com/2008/11/unicode-for-grerlvy.html

Ray Tayek
  • 9,841
  • 8
  • 50
  • 90
-1

Its an impossible task, Unicode supports even Klingon, so it's not going to work. However most text editors use the standard ANSI invisible characters. And if your Unicode library is good, it will support finding equivalent characters and/or categories, you can use these two features to do it as well as any editor out there

Edit: Yes I was being silly about Klingon support, but that doesn't make it not true... of course Klingon is not supported by the Consortium, however there is a movement for Klingon in the Unicode's "Private Use Area" defined for Klingon alphabet (U+F8D0 - U+F8FF). Link here for those interested :)

Note: Wonder what editor Klingon programmers use...

Robert Gould
  • 68,773
  • 61
  • 187
  • 272
  • The actual Unicode standard does not include fictional scripts - perhaps the Consortium will add them someday, but for now they have far more to worry about. But the parts of Unicode that are mapped are very well-defined, so there is a comprehensive list of invisible characters. – coppro Nov 20 '08 at 06:51
  • I think Robert was being facetious about support for Klingon. I realize that there are too many characters to make this approach feasible so I am looking for alternatives. – titaniumdecoy Nov 20 '08 at 07:10
  • Unicode doesn't support Klingon because everyone who writes Klingon does so w/ ASCII. The fancy Klingon characters aren't used in practice. But if I'm not mistaken, other imaginary scripts are supported. – Logomachist Nov 20 '08 at 07:12
  • Yeah like elvin (I don't remember the proper name for the script tho) – Jason Coco Nov 20 '08 at 07:24