Is there any reasonable way to access the contents of a CharacterSet?

Question

For a random string generator, I thought it would be nice to use CharacterSet as input type for the alphabet to use, since the pre-defined sets such as CharacterSet.lowercaseLetters are obviously useful (even if they may contain more diverse character sets than you'd expect).

However, apparently you can only query character sets for membership, but not enumerate let alone index them. All we get is _.bitmapRepresentation, a 8kb chunk of data with an indicator bit for every (?) character. But even if you peel out individual bits by index i (which is less than nice, going through byte-oriented Data), Character(UnicodeScalar(i)) does not give the correct letter. Which means that the format is somewhat obscure -- and, of course, it's not documented.

Of course we can iterate over all characters (per plane) but that is a bad idea, cost-wise: a 20-character set may require iterating over tens of thousands of characters. Speaking in CS terms: bit-vectors are a (very) bad implementation for sparse sets. Why they chose to make the trade-off in this way here, I have no idea.

Am I missing something here, or is CharacterSet just another deadend in the Foundation API?

This might be what you are looking for: [NSArray from NSCharacterset](http://stackoverflow.com/questions/15741631/nsarray-from-nscharacterset) – Despite the title, there is also Swift (2 + 3) code. — Martin R, Apr 10 '17 at 11:58
But note that `CharacterSet.lowercaseLetters` contains 1841 characters, not only from the latin alphabet, but also greek, armenian, ..., as well as variants like double-strike letters ("") or ligatures ("ﬄ"). — Martin R, Apr 10 '17 at 12:08

Cœur · Answer 1 · 2018-09-02T12:57:27.640

Following the documentation, here is an improvement on Satachito answer to support cases of non-continuous planes, by actually taking into account the plane index:

extension CharacterSet {
    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 8193
            if k == 8192 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }

    func printHexValues() {
        codePoints().forEach { print(String(format:"%02X", $0)) }
    }
}

Usage

print("whitespaces:")
CharacterSet.whitespaces.printHexValues()
print()
print("two characters from different planes:")
CharacterSet(charactersIn: "").printHexValues()

Results

whitespaces:
09
20
A0
1680
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
200A
200B
202F
205F
3000

two characters from different planes:
1D6A8
CC791

Performances

This is effectively 3 to 10 times faster than iterating over all characters: comparison is done with the previous answers at NSArray from NSCharacterset.

Satachito · Answer 2 · 2019-04-03T06:40:07.213

4

bitmapRepresentation has been documented.

https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation

So iterate over that Data like below:

var offset = 0
for ( var i, w ) in CharacterSet.whitespaces.bitmapRepresentation.enumerated() {
    if i % 8193 == 8192 {
        offset += 1
        continue
    }
    i -= offset
    if w != 0 {
        for j in 0 ..< 8 {
            if w & ( 1 << j ) != 0 {
                print( String( format:"%02X", i * 8 + j ) )
            }
        }
    }
}

Result:

edited Apr 03 '19 at 06:40

answered Jul 05 '18 at 23:06

Satachito

5,838
36
43

I've fixed the algorithm to partially account for the byte 0x01 for the plane index: it was previously failing for `CharacterSet.uppercaseLetters`. Yet, it's still not good for discountinous planes. See my answer. – Cœur Sep 02 '18 at 02:41

score 2 · Accepted Answer · edited Apr 11 '17 at 19:16

By your definition, no, there is no "reasonable" way. That's just how NSCharacterSet stores it. It's optimized for testing membership, not enumerating all members.

Your loop can increment a counter over the codepoints, or it can shift the bits (one per codepoint), but either way you have to loop and test. The highest "Ll" character on my Mac is U+1D7CB (#120,779), so if you want to compute this list of characters at runtime, your code will have to loop at least that many times. See the Objective-C version of the documentation for details on how the bit vector is organized.

The good news is that this is fast. With unoptimized code on my 10-year-old Mac, it takes less than 1/10th of a second to find all 1,841 lowercaseLetters. If that's still not fast enough, it's easy to hide the cost by doing it once, in the background, at startup time.

If your set is predefined, it would be even faster to hardcode the 1,841 values at compile time. — Cœur, Nov 30 '18 at 01:52

Is there any reasonable way to access the contents of a CharacterSet?

3 Answers3

Usage

Results

Performances

Linked