NSArray from NSCharacterSet

Question

Currently I am able to make array of Alphabets like below

[[NSArray alloc]initWithObjects:@"A",@"B",@"C",@"D",@"E",@"F",@"G",@"H",@"I",@"J",@"K",@"L",@"M",@"N",@"O",@"P",@"Q",@"R",@"S",@"T",@"U",@"V",@"W",@"X",@"Y",@"Z",nil];

Knowing that is available over

[NSCharacterSet uppercaseLetterCharacterSet]

How to make an array out of it?

Why you need this. Or just for fun? If you can tell why you need it in array then it would be good. — Anoop Vaidya, Apr 01 '13 at 10:29
The uppercaseLetterCharacterSet contains a lot more than just A...Z. — CommaToast, Aug 12 '16 at 18:17

Martin R · Accepted Answer · 2016-09-03T11:00:45.847

55

The following code creates an array containing all characters of a given character set. It works also for characters outside of the "basic multilingual plane" (characters > U+FFFF, e.g. U+10400 DESERET CAPITAL LETTER LONG I).

NSCharacterSet *charset = [NSCharacterSet uppercaseLetterCharacterSet];
NSMutableArray *array = [NSMutableArray array];
for (int plane = 0; plane <= 16; plane++) {
    if ([charset hasMemberInPlane:plane]) {
        UTF32Char c;
        for (c = plane << 16; c < (plane+1) << 16; c++) {
            if ([charset longCharacterIsMember:c]) {
                UTF32Char c1 = OSSwapHostToLittleInt32(c); // To make it byte-order safe
                NSString *s = [[NSString alloc] initWithBytes:&c1 length:4 encoding:NSUTF32LittleEndianStringEncoding];
                [array addObject:s];
            }
        }
    }
}

For the uppercaseLetterCharacterSet this gives an array of 1467 elements. But note that characters > U+FFFF are stored as UTF-16 surrogate pair in NSString, so for example U+10400 actually is stored in NSString as 2 characters "\uD801\uDC00".

Swift 2 code can be found in other answers to this question. Here is a Swift 3 version, written as an extension method:

extension CharacterSet {
    func allCharacters() -> [Character] {
        var result: [Character] = []
        for plane: UInt8 in 0...16 where self.hasMember(inPlane: plane) {
            for unicode in UInt32(plane) << 16 ..< UInt32(plane + 1) << 16 {
                if let uniChar = UnicodeScalar(unicode), self.contains(uniChar) {
                    result.append(Character(uniChar))
                }
            }
        }
        return result
    }
}

Example:

let charset = CharacterSet.uppercaseLetters
let chars = charset.allCharacters()
print(chars.count) // 1521
print(chars) // ["A", "B", "C", ... "]

(Note that some characters may not be present in the font used to display the result.)

edited Sep 03 '16 at 11:00

answered Apr 01 '13 at 11:32

Martin R

529,903
94
1,240
1,382

Thanks @Martin R I was having trouble with contains (unicodeScalar) bit. this is awesome :) – the Reverend Sep 03 '16 at 17:17
@MartinR I know that the question is about NSArray but I think you should post one for converting `CharacterSet` into a `Set` as well. It looks to me that it is around 10x faster to generate a set of characters instead of an array. Can you confirm that? – Leo Dabus Jan 20 '22 at 14:38
I will test in a real project as well to make sure but seems much faster when using a playground – Leo Dabus Jan 20 '22 at 14:43
@MartinR also if the order doesn't matter creating an array from the set seems the best choice. – Leo Dabus Jan 20 '22 at 16:14
1

@LeoDabus: Cannot confirm: `CharacterSet.letters.allCharacters()` takes approx 0.0064 seconds on my computer, that set has 127391 elements. With a set as result type, the time is 0.072 seconds. And that is what I would expect: appending to an array is faster than inserting into a set. – Tested with Release configuration in a compiled Xcode project. – Martin R Jan 20 '22 at 16:21
@MartinR Thats crazy. Do you have any idea why it takes so long in a playground? – Leo Dabus Jan 20 '22 at 16:25
@LeoDabus: I do not know the details, but it is known that Playgrounds are sloooow. The code is not compiled with optimization (AFAIK) and executed in a special way, so that all intermediate results can be displayed in the results bar. I cannot tell why using a set makes it faster in a Playground. – Martin R Jan 20 '22 at 19:41
@MartinR OK thanks – Leo Dabus Jan 20 '22 at 19:46

Cœur · Answer 2 · 2022-01-21T02:05:07.107

Inspired by Satachito answer, here is a performant way to make an Array from CharacterSet using bitmapRepresentation:

extension CharacterSet {
    func characters() -> [Character] {
        // A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive.
        return codePoints().compactMap { UnicodeScalar($0) }.map { Character($0) }
    }
    
    func codePoints() -> [Int] {
        var result: [Int] = []
        var plane = 0
        // following documentation at https://developer.apple.com/documentation/foundation/nscharacterset/1417719-bitmaprepresentation
        for (i, w) in bitmapRepresentation.enumerated() {
            let k = i % 0x2001
            if k == 0x2000 {
                // plane index byte
                plane = Int(w) << 13
                continue
            }
            let base = (plane + k) << 3
            for j in 0 ..< 8 where w & 1 << j != 0 {
                result.append(base + j)
            }
        }
        return result
    }
}

Example for uppercaseLetters

let charset = CharacterSet.uppercaseLetters
let chars = charset.characters()
print(chars.count) // 1733
print(chars) // ["A", "B", "C", ... "]

Example for discontinuous planes

let charset = CharacterSet(charactersIn: "")
let codePoints = charset.codePoints()
print(codePoints) // [120488, 837521]

Performances

Very good depending on the data/usage: this solution built in release with bitmapRepresentation seems 2 to 10 times faster than Martin R's solution with contains or Oliver Atkinson's solution with longCharacterIsMember.

Be sure to compare depending on your own needs: performances are best compared in a non-debug build; so avoid comparing performances in a Playground.

Sorry to inform you that Martin R answer is much faster than this approach which is about 30-35% slower. — Leo Dabus, Jan 20 '22 at 13:29
@LeoDabus Maybe it depends on the data. I'll edit to be more neutral regarding performances. — Cœur, Jan 20 '22 at 14:12
@Cœur: I can confirm that your function is significantly faster than mine in a compiled command line project in Release mode, about twice as fast for `CharacterSet.letters.characters()`. — Martin R, Jan 20 '22 at 19:46

score 10 · Answer 3 · edited Sep 02 '18 at 06:22

10

Since characters have a limited, finite (and not too wide) range, you can just test which characters are members of a given character set (brute force):

// this doesn't seem to be available
#define UNICHAR_MAX (1ull << (CHAR_BIT * sizeof(unichar)))

NSData *data = [[NSCharacterSet uppercaseLetterCharacterSet] bitmapRepresentation];
uint8_t *ptr = [data bytes];
NSMutableArray *allCharsInSet = [NSMutableArray array];
// following from Apple's sample code
for (unichar i = 0; i < UNICHAR_MAX; i++) {
    if (ptr[i >> 3] & (1u << (i & 7))) {
        [allCharsInSet addObject:[NSString stringWithCharacters:&i length:1]];
    }
}

Remark: Due to the size of a unichar and the structure of the additional segments in bitmapRepresentation, this solution only works for characters <= 0xFFFF and is not suitable for higher planes.

edited Sep 02 '18 at 06:22

Cœur

37,241
25
195
267

answered Apr 01 '13 at 10:29

5

oooppppssssss. To understand this code, we need 50K+ reputations. People will get scared by this code. – Anoop Vaidya Apr 01 '13 at 10:33
@H2CO3, I thought i am just not knowing an existence of a method to call on NSCharacterSet or NSString to do this job with a one line statement. Looks like it is truly not exists. Good to see the possibility from your response. Thanks. – Saran Apr 01 '13 at 11:56
1

Remark: This works only for characters <= 0xFFFF. The `uppercaseLetterCharacterSet` contains 1467 characters, this method gives only the first 871 characters. – Martin R Apr 01 '13 at 12:02
@MartinR Right, at least as long as `unichar` is two ~~bytes~~ octets long (which it is on iOS and OS X). – Apr 01 '13 at 12:04
@H2CO3: `NSCharacterSet` works also with characters outside the BMP, even if `NSString` uses `unichar` internally. – Martin R Apr 01 '13 at 12:09
Personally I think that the OP doesn't even know what's he' asking because getting a list of all the `Lu` and `Lt` characters doesn't have a real use. – Sulthan Apr 02 '13 at 09:12
@Sulthan Yes, that's quite possible. But anyways, he got what he asked for :) Better be technically correct than make wrong assumptions. – Apr 02 '13 at 09:14
Can't you use `[charSet characterIsMember:]` to check if `unichar` is in the set? – Arc676 Dec 29 '14 at 05:29
Needs a lot of memory! My device runs out of memory! – Abdurrahman Mubeen Ali May 28 '15 at 07:21

felipou · Answer 4 · 2016-01-14T19:18:30.320

4

I created a Swift (v2.1) version of Martin R's algorithm:

let charset = NSCharacterSet.URLPathAllowedCharacterSet();

for var plane : UInt8 in 0...16 {
    if charset.hasMemberInPlane( plane ) {
        var c : UTF32Char;

        for var c : UInt32 = UInt32( plane ) << 16; c < (UInt32(plane)+1) << 16; c++ {
            if charset.longCharacterIsMember(c) {
                var c1 = c.littleEndian // To make it byte-order safe
                let s = NSString(bytes: &c1, length: 4, encoding: NSUTF32LittleEndianStringEncoding);
                NSLog("Char: \(s)");
            }
        }
    }
}

edited Jan 14 '16 at 19:18

answered Nov 25 '15 at 13:24

felipou

654
6
16

1

`c1` is unlikely to work as `let` because of in-out `&`, should probably be `var` – Desmond Hume Jan 13 '16 at 18:13
You're right, I fixed it. But I was sure I had tested this before... Well, anyway, it's correct as of Swift 2.1.1, just tested it (`Apple Swift version 2.1.1 (swiftlang-700.1.101.15 clang-700.1.81)`) – felipou Jan 14 '16 at 19:20
Now that explains it! How could I not see that? Well, thanks for pointing it out @DesmondHume :) – felipou Jan 16 '16 at 13:37
@felipou: I apologize for the confusion. I wanted to add the (Swift equivalent of) OSSwapHostToLittleInt32 and then made some errors. Everything should be correct now. – Martin R Jan 18 '16 at 17:52
No problem @MartinR, I understand, it's much better this way. Thanks for the contribution :) – felipou Jan 19 '16 at 18:41
longCharacterIsMember appears to be gone for Swift 3 – David James Aug 23 '16 at 16:14
Nevermind, just use `(characterSet as NSCharacterSet).longCharacterIsMember(c)` in Swift 3 (Xcode 8 Beta 6) – David James Aug 23 '16 at 16:26

Oliver Atkinson · Answer 5 · 2017-03-20T13:01:01.387

This is done using a little more of swift for swift.

let characters = NSCharacterSet.uppercaseLetterCharacterSet()
var array      = [String]()

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {

  for character: UTF32Char in UInt32(plane) << 16..<(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {

    var endian = character.littleEndian
    let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String

    array.append(string)

  }

}

print(array)

score 2 · Answer 6 · answered Mar 09 '18 at 13:33

I found Martin R's solution to be too slow for my purposes, so I solved it another way using CharacterSet's bitmapRepresentation property.

This is significantly faster according to my benchmarks:

var ranges = [CountableClosedRange<UInt32>]()
let bitmap: Data = characterSet.bitmapRepresentation
var first: UInt32?, last: UInt32?
var plane = 0, nextPlane = 8192
for (j, byte) in bitmap.enumerated() where byte != 0 {
    if j == nextPlane {
        plane += 1
        nextPlane += 8193
        continue
    }
    for i in 0 ..< 8 where byte & 1 << i != 0 {
        let codePoint = UInt32(j - plane) * 8 + UInt32(i)
        if let _last = last, codePoint == _last + 1 {
            last = codePoint
        } else {
            if let first = first, let last = last {
                ranges.append(first ... last)
            }
            first = codePoint
            last = codePoint
        }
    }
}
if let first = first, let last = last {
    ranges.append(first ... last)
}
return ranges

This solution returns an array of codePoint ranges, but you can easily adapt it to return individual characters or strings, etc.

Actually, there is a significant error in your algorithm: it will not support `CharacterSet(charactersIn: "")` because you do not read the value of the plane index byte (you wrongly assumed they were continous). See https://stackoverflow.com/a/52133647/1033581 for how I did it. — Cœur, Sep 02 '18 at 06:42

score 1 · Answer 7 · answered Nov 02 '17 at 15:45

You should not; this is not the purpose of a character set. A NSCharacterSet is a possibly-infinite set of characters, possibly in not-yet-invented code points. All you want to know is "Is this character or collection of characters in this set?", and to that end it is useful.

Imagine this Swift code:

let asciiCodepoints = Unicode.Scalar(0x00)...Unicode.Scalar(0x7F)
let asciiCharacterSet = CharacterSet(charactersIn: asciiCodepoints)
let nonAsciiCharacterSet = asciiCharacterSet.inverted

Which is analogous to this Objective-C code:

NSRange asciiCodepoints = NSMakeRange(0x00, 0x7F);
NSCharacterSet * asciiCharacterSet = [NSCharacterSet characterSetWithRange:asciiCodepoints];
NSCharacterSet * nonAsciiCharacterSet = asciiCharacterSet.invertedSet;

It's easy to say "loop over all the characters in asciiCharacterSet"; that would just loop over all characters from U+0000 through U+007F. But what does it mean to loop over all the characters in nonAsciiCharacterSet? Do you start at U+0080? Who's to say there won't be negative codepoints in the future? Where do you end? Do you skip non-printable characters? What about extended grapheme clusters? Since it's a set (where order doesn't matter), can your code handle out-of-order codepoints in this loop?

These are questions you don't want to answer here; functionally nonAsciiCharacterSet is infinite, and all you want to use it for is to tell if any given character lies outside the set of ASCII characters.

The question you should really be asking yourself is: "What do I want to accomplish with this array of capital letters?" If (and likely only if) you really need to iterate over it in order, putting the ones you care about into an Array or String (perhaps one read in from a resource file) is probably the best way. If you want to check to see if a character is part of the set of uppercase letters, then you don't care about order or even how many characters are in the set, and should use CharacterSet.uppercaseLetters.contains(foo) (in Objective-C: [NSCharacterSet.uppercaseLetterCharacterSet contains: foo]).

Think, too, about non-latin characters. CharacterSet.uppercaseLetters covers Unicode General Categories Lu and Lt, which contain A through Z and also things like ǅ, , and Խ. You don't want to have to think about this. You definitely don't want to issue an update to your app when the Unicode Consortium adds new characters to this list. If what you want to do is decide whether something is upper-case, don't bother hard-coding anything.

A CharacterSet, by its struct definition, is finite: it has at most 17 planes of 8192 endpoints. — Cœur, Sep 02 '18 at 10:24
@Cœur has it always been like that? Will it always be like that? What if an 18th plane is needed? Can you provide official documentation promising all this? — Ky -, Sep 04 '18 at 17:46

score 0 · Answer 8 · answered Aug 12 '16 at 18:20

For just A-Z of the Latin alphabet (nothing with Greek, or diacritical marks, or other things that were not what the guy asked for):

for plane: UInt8 in 0...16 where characters.hasMemberInPlane(plane) {
    i = 0
    for character: UTF32Char in UInt32(plane) << 16...(UInt32(plane) + 1) << 16 where characters.longCharacterIsMember(character) {
        var endian = character.littleEndian
        let string = NSString(bytes: &endian, length: 4, encoding: NSUTF32LittleEndianStringEncoding) as! String
        array.append(string)
        if(array.count == 26) {
            break
        }
    }
    if(array.count == 26) {
        break
    }
}

If you know there is going to be 26 characters, then you're not working with an arbitrary character set, which means you can optimize it in speed and in length with just `return ["A","B","C","D","E","F","G","H","I","J","K","L","M","N","O","P","Q","R","S","T","U","V","W","X","Y","Z"]` — Cœur, Sep 02 '18 at 10:20

Paul B · Answer 9 · 2019-09-20T11:25:21.513

You can of course create sets of characters and alphabets using CharacterSet like this:

var smallEmojiCharacterSet = CharacterSet(charactersIn:  Unicode.Scalar("")...Unicode.Scalar(""))

The problem is that CharacterSet is NOT a Set (though it conforms to SetAlgebra), it is rather a unicode character set . This causes the problem of getting a sequence of all it's characters, to convert it to Array, Set or a String. I have found a solution, but a better one exists. Actually, what you want is to stride from character to character, to have a range "a"..."z". It is not hard to do at the scalar level. At Character level there are more caveats to consider.

extension Unicode.Scalar: Strideable {
    public typealias Stride = Int

    public func distance(to other: Unicode.Scalar) -> Int {
        return Int(other.value) - Int(self.value)
    }

    public func advanced(by n: Int) -> Unicode.Scalar {
        return Unicode.Scalar(UInt32(Int(value) + n))!
    }
}


let alphabetScalarRange = (Unicode.Scalar("a")...Unicode.Scalar("z"))// ClosedRange<Unicode.Scalar>

let alphabetCharactersArr = Array(alphabetScalarRange.map(Character.init)) // Array of Characters from range
let alphabetStringsArr = Array(alphabetScalarRange.map(String.init)) // Array of Strings from range
let alphabetString = alphabetStringsArr.joined() // String (collection of characters) from range
// or simply
let uppercasedAlphabetString =  (("A" as Unicode.Scalar)..."Z").reduce("") { (r, us) -> String in
    r + String(us)
}

If you think making an extension is an overkill

let alphabetScalarValueRange = (Unicode.Scalar("a").value...Unicode.Scalar("z").value)
let alphabetStringsArr2 = Array(alphabetScalarValueRange.compactMap{ Unicode.Scalar($0)?.escaped(asASCII: false) })
let alphabetString2 = alphabetScalarValueRange.compactMap({ Unicode.Scalar($0)?.escaped(asASCII: false) }).joined(separator: ", ")

But be careful: Characters can consist of several scalars.

NSArray from NSCharacterSet

9 Answers9

Example for uppercaseLetters

Example for discontinuous planes

Performances

Linked

Related