56

How can I extract the Unicode code point(s) of a given Character without first converting it to a String? I know that I can use the following:

let ch: Character = "A"
let s = String(ch).unicodeScalars
s[s.startIndex].value // returns 65

but it seems like there should be a more direct way to accomplish this using just Swift's standard library. The Language Guide sections "Working with Characters" and "Unicode" only discuss iterating through the characters in a String, not working directly with Characters.

nathan
  • 5,466
  • 3
  • 27
  • 24
  • `Characters` in swift are not necessarily a single Unicode codepoint. E.g. `let ch: Character = "e\u0308"`. In general codepoints and characters are different concepts, and you shouldn't confuse one for the other. – bames53 Aug 08 '14 at 15:02
  • 1
    @bames53 I am aware of that; however, there should be a way to extract the list of code points from a `Character` without first converting it to a `String`. – nathan Aug 09 '14 at 15:50
  • Check out [this online conversion tool by Richard Ishida](http://r12a.github.io/apps/conversion/) and investigate how he did it. – WoodrowShigeru Jun 27 '16 at 20:15

6 Answers6

35

From what I can gather in the documentation, they want you to get Character values from a String because it gives context. Is this Character encoded with UTF8, UTF16, or 21-bit code points (scalars)?

If you look at how a Character is defined in the Swift framework, it is actually an enum value. This is probably done due to the various representations from String.utf8, String.utf16, and String.unicodeScalars.

It seems they do not expect you to work with Character values but rather Strings and you as the programmer decide how to get these from the String itself, allowing encoding to be preserved.

That said, if you need to get the code points in a concise manner, I would recommend an extension like such:

extension Character
{
    func unicodeScalarCodePoint() -> UInt32
    {
        let characterString = String(self)
        let scalars = characterString.unicodeScalars

        return scalars[scalars.startIndex].value
    }
}

Then you can use it like so:

let char : Character = "A"
char.unicodeScalarCodePoint()

In summary, string and character encoding is a tricky thing when you factor in all the possibilities. In order to allow each possibility to be represented, they went with this scheme.

Also remember this is a 1.0 release, I'm sure they will expand Swift's syntactical sugar soon.

chasew
  • 8,438
  • 7
  • 41
  • 48
Erik
  • 12,730
  • 5
  • 36
  • 42
  • 3
    `Character` values definitely have enough context to determine the code point: for example, they can be printed and concatenated to `String`s and other `Character`s, and the solution via `String` wouldn't work otherwise. Am I right in concluding that this is just missing from the standard library? – nathan Jun 08 '14 at 03:04
  • It is entirely possible that this was something they trimmed out of the 1.0 release for the sake of time. I can see it being "working enough for most developers" for the time being. – Erik Jun 08 '14 at 03:06
  • 1
    Not sure what Integer is but it's pretty hard to work with. I'd return Int instead. – Daniel Schlaug Jun 16 '14 at 11:41
  • 4
    Beta4 added full Character support, A Character can now hold full grapheme clusters. See: [Strings in Swift](http://oleb.net/blog/2014/07/swift-strings/) by Ole Begemann – zaph Aug 08 '14 at 12:49
  • 2
    @Erik_at_Digit Your code don't work anymore please see this question http://stackoverflow.com/questions/30334653/strange-behavior-with-swift-compiler in other case you have to return an `UInt32` instead of `Int` – Victor Sigler May 19 '15 at 19:52
23

I think there are some misunderstandings about the Unicode. Unicode itself is NOT an encoding, it does not transform any grapheme clusters (or "Characters" from human reading respect) into any sort of binary sequence. The Unicode is just a big table which collects all the grapheme clusters used by all languages on Earth (unofficially also includes the Klingon). Those grapheme clusters are organized and indexed by the code points (a 21-bit number in swift, and looks like U+D800). You can find where the character you are looking for in the big Unicode table by using the code points

Meanwhile, the protocol called UTF8, UTF16, UTF32 is actually encodings. Yes, there are more than one ways to encode the Unicode characters into binary sequences. Using which protocol depends on the project you are working, but most of the web page is encoded by UTF-8 (you can actually check it now).

Concept 1: The Unicode point is called the Unicode Scalar in Swift

A Unicode scalar is any Unicode code point in the range U+0000 to U+D7FF inclusive or U+E000 to U+10FFFF inclusive. Unicode scalars do not include the Unicode surrogate pair code points, which are the code points in the range U+D800 to U+DFFF inclusive.

Concept 2: The Code Unit is the abstract representation of the encoding.

Consider the following code snippet

let theCat = "Cat!"

for char in theCat.utf8 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-8 encoding
}
print("")
for char in theCat.utf8 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-8 encoding
}
print("")


for char in theCat.utf16 {
    print("\(char) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-16 encoding
}
print("")
for char in theCat.utf16 {
    print("\(String(char, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-16 encoding
}
print("")

for char in theCat.unicodeScalars {
    print("\(char.value) ", terminator: "") //Code Unit of each grapheme cluster for the UTF-32 encoding
}
print("")
for char in theCat.unicodeScalars {
    print("\(String(char.value, radix: 2)) ", terminator: "") //Encoding of each grapheme cluster for the UTF-32 encoding
}

Abstract representation means: Code unit is written by the base-10 number (decimal number) it equals to the base-2 encoding (binary sequence). Encoding is made for the machines, Code Unit is more for humans, it is easy to read than binary sequences.

Concept 3: A character may have different Unicode point(s). It depends on how the character is contracted by what grapheme clusters, (this is why I said "Characters" from human reading respect in the beginning)

consider the following code snippet

let precomposed: String = "\u{D55C}"
let decomposed: String = "\u{1112}\u{1161}\u{11AB}" 
print(precomposed.characters.count) // print "1"
print(decomposed.characters.count) // print "1" => Character != grapheme cluster
print(precomposed) //print "한"
print(decomposed) //print "한"

The character precomposed and decomposed is visually and linguistically equal, But they have different Unicode point and different code unit if they encoded by the same encoding protocol (see the following example)

for preCha in precomposed.utf16 {
    print("\(preCha) ", terminator: "") //print 55357 56374 128054 54620
}

print("")

for deCha in decomposed.utf16 {
    print("\(deCha) ", terminator: "") //print 4370 4449 4523
}

Extra example

var word = "cafe"
print("the number of characters in \(word) is \(word.characters.count)")

word += "\u{301}"

print("the number of characters in \(word) is \(word.characters.count)")

Summary: Code Points, A.k.a the position index of the characters in Unicode, has nothing to do with UTF-8, UTF-16 and UTF-32 encoding schemes.

Further Readings:

http://www.joelonsoftware.com/articles/Unicode.html

http://kunststube.net/encoding/

https://www.mikeash.com/pyblog/friday-qa-2015-11-06-why-is-swifts-string-api-so-hard.html

0xF
  • 3,214
  • 1
  • 25
  • 29
SLN
  • 4,772
  • 2
  • 38
  • 79
8

I think the issue is that Character doesn't represent a Unicode code point. It represents a "Unicode grapheme cluster", which can consist of multiple code points.

Instead, UnicodeScalar represents a Unicode code point.

newacct
  • 119,665
  • 29
  • 163
  • 224
  • 3
    I edited the question to use "code point(s)" instead of "code point." I was mostly thinking of characters that did represent a single code point when I wrote it, but that's not really the issue here--a `Character` is still a container for Unicode scalars, which you should be able to extract directly, i.e., without first converting it to a `String`. – nathan Aug 09 '14 at 16:08
7

I agree with you, there should be a way to get the code directly from character. But all I can offer is a shorthand:

let ch: Character = "A"
for code in String(ch).utf8 { println(code) }
evpozdniakov
  • 533
  • 6
  • 6
  • I got here looking for how to get ASCII values from a Swift `Character` (specifically for the Latin alphabet, A-Z and a-z). This works wonderfully in Swift 3.0.1, iOS 10 with the minor change of using `print(code)` rather than `println(code)`. – leanne Dec 04 '16 at 15:50
3

#1. Using Unicode.Scalar's value property

With Swift 5, Unicode.Scalar has a value property that has the following declaration:

A numeric representation of the Unicode scalar.

var value: UInt32 { get }

The following Playground sample code shows how to iterate over the unicodeScalars property of a Character and print the value of each Unicode scalar that composes it:

let character: Character = "A"
for scalar in character.unicodeScalars {
    print(scalar.value)
}

/*
 prints: 65
 */

As an alternative, you can use the sample code below if you only want to print the value of the first unicode scalar of a Character:

let character: Character = "A"
let scalars = character.unicodeScalars
let firstScalar = scalars[scalars.startIndex]
print(firstScalar.value)

/*
 prints: 65
 */

#2. Using Character's asciiValue property

If what you really want is to get the ASCII encoding value of a character, you can use Character's asciiValue. asciiValue has the following declaration:

Returns the ASCII encoding value of this Character, if ASCII.

var asciiValue: UInt8? { get }

The Playground sample code below show how to use asciiValue:

let character: Character = "A"
print(String(describing: character.asciiValue))

/*
 prints: Optional(65)
 */
let character: Character = "П"
print(String(describing: character.asciiValue))

/*
 prints: nil
 */
Imanou Petit
  • 89,880
  • 29
  • 256
  • 218
0

Have you tried:

import Foundation

let characterString: String = "abc"
var numbers: [Int] = Array<Int>()
for character in characterString.utf8 {
    let stringSegment: String = "\(character)"
    let anInt: Int = stringSegment.toInt()!
    numbers.append(anInt)
}

numbers

Output:

[97, 98, 99]

It may also be only one Character in the String.

Binarian
  • 12,296
  • 8
  • 53
  • 84
  • utf8 gives a collection of CodeUnits not Ints. CodeUnits might print as Ints but they are not convertible to Ints. So unfortunately this won't work - pity. – Howard Lovatt Aug 15 '14 at 04:12
  • 1
    @HowardLovatt You can convert the String with the numbers to Int. Split the String with space as seperator and then use `.toInt()` on each substring. Then you have [Int]. – Binarian Aug 15 '14 at 09:24
  • Updated the answer, to get an Array of `Int` s – Binarian Aug 15 '14 at 09:35
  • Doesn't work as of version 6.1.1 (6A2008a) because toInt() tries to convert the string content to an integer if it represents a number. It doesn **not** return the unicode code point. – Mike Lischke Dec 06 '14 at 16:51