25

We know we can print each character in UTF8 code units? Then, if we have code units of these characters, how can we create a String with them?

Imanou Petit
  • 89,880
  • 29
  • 256
  • 218
jxwho
  • 385
  • 1
  • 3
  • 10

10 Answers10

18

With Swift 5, you can choose one of the following ways in order to convert a collection of UTF-8 code units into a string.


#1. Using String's init(_:) initializer

If you have a String.UTF8View instance (i.e. a collection of UTF-8 code units) and want to convert it to a string, you can use init(_:) initializer. init(_:) has the following declaration:

init(_ utf8: String.UTF8View)

Creates a string corresponding to the given sequence of UTF-8 code units.

The Playground sample code below shows how to use init(_:):

let string = "Café "
let utf8View: String.UTF8View = string.utf8

let newString = String(utf8View)
print(newString) // prints: Café 

#2. Using Swift's init(decoding:as:) initializer

init(decoding:as:) creates a string from the given Unicode code units collection in the specified encoding:

let string = "Café "
let codeUnits: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(decoding: codeUnits, as: UTF8.self)
print(newString) // prints: Café 

Note that init(decoding:as:) also works with String.UTF8View parameter:

let string = "Café "
let utf8View: String.UTF8View = string.utf8

let newString = String(decoding: utf8View, as: UTF8.self)
print(newString) // prints: Café 

#3. Using transcode(_:from:to:stoppingOnError:into:) function

The following example transcodes the UTF-8 representation of an initial string into Unicode scalar values (UTF-32 code units) that can be used to build a new string:

let string = "Café "
let bytes = Array(string.utf8)

var newString = ""
_ = transcode(bytes.makeIterator(), from: UTF8.self, to: UTF32.self, stoppingOnError: true, into: {
    newString.append(String(Unicode.Scalar($0)!))
})
print(newString) // prints: Café 

#4. Using Array's withUnsafeBufferPointer(_:) method and String's init(cString:) initializer

init(cString:) has the following declaration:

init(cString: UnsafePointer<CChar>)

Creates a new string by copying the null-terminated UTF-8 data referenced by the given pointer.

The following example shows how to use init(cString:) with a pointer to the content of a CChar array (i.e. a well-formed UTF-8 code unit sequence) in order to create a string from it:

let bytes: [CChar] = [67, 97, 102, -61, -87, 32, -16, -97, -121, -85, -16, -97, -121, -73, 0]

let newString = bytes.withUnsafeBufferPointer({ (bufferPointer: UnsafeBufferPointer<CChar>)in
    return String(cString: bufferPointer.baseAddress!)
})
print(newString) // prints: Café 

#5. Using Unicode.UTF8's decode(_:) method

To decode a code unit sequence, call decode(_:) repeatedly until it returns UnicodeDecodingResult.emptyInput:

let string = "Café "
let codeUnits = Array(string.utf8)

var codeUnitIterator = codeUnits.makeIterator()
var utf8Decoder = Unicode.UTF8()
var newString = ""

Decode: while true {
    switch utf8Decoder.decode(&codeUnitIterator) {
    case .scalarValue(let value):
        newString.append(Character(Unicode.Scalar(value)))
    case .emptyInput:
        break Decode
    case .error:
        print("Decoding error")
        break Decode
    }
}

print(newString) // prints: Café 

#6. Using String's init(bytes:encoding:) initializer

Foundation gives String a init(bytes:encoding:) initializer that you can use as indicated in the Playground sample code below:

import Foundation

let string = "Café "
let bytes: [Unicode.UTF8.CodeUnit] = Array(string.utf8)

let newString = String(bytes: bytes, encoding: String.Encoding.utf8)
print(String(describing: newString)) // prints: Optional("Café ")
Imanou Petit
  • 89,880
  • 29
  • 256
  • 218
15

It's possible to convert UTF8 code points to a Swift String idiomatically using the UTF8 Swift class. Although it's much easier to convert from String to UTF8!

import Foundation

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.generate()
    var finished: Bool = false
    do {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .Result(let char):
        encodedString.append(char)
      case .EmptyInput:
        finished = true
      /* ignore errors and unexpected values */
      case .Error:
        finished = true
      default:
        finished = true
      }
    } while (!finished)
    return encodedString
  }

  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

func testUTF8Encoding() {
  let testString = "A UTF8 String With Special Characters: "
  let decodedArray = UTF8Encoding.decode(testString)
  let encodedString = UTF8Encoding.encode(decodedArray)
  XCTAssert(encodedString == testString, "UTF8Encoding is lossless: \(encodedString) != \(testString)")
}

Of the other alternatives suggested:

  • Using NSString invokes the Objective-C bridge;

  • Using UnicodeScalar is error-prone because it converts UnicodeScalars directly to Characters, ignoring complex grapheme clusters; and

  • Using String.fromCString is potentially unsafe as it uses pointers.

Tim WB
  • 201
  • 2
  • 2
  • 2
    Thank you for decoding UTF8 encoding! You can remove `import Foundation` from the top, that's the whole reason I want to use this.. – ephemer Sep 28 '15 at 23:25
  • Thanks! Very helpful. Here is a link to the Sandbox with this working with a couple updates and made decode a bit easier. http://swiftlang.ng.bluemix.net/#/repl/2dde62756a95d6d1c7bb88068cb35ebfe4b13ffc3ec891856992166caa8a291d – Pat Apr 13 '16 at 22:45
  • Your use of the words "encode" and "decode" are the opposite of how I think about the conversions between strings and UTF-8 data. – RenniePet Jan 10 '17 at 13:20
5

improve on Martin R's answer

import AppKit

let utf8 : CChar[] = [65, 66, 67, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: ABC

import AppKit

let utf8 : UInt8[] = [0xE2, 0x82, 0xAC, 0]
let str = NSString(bytes: utf8, length: utf8.count, encoding: NSUTF8StringEncoding)
println(str) // Output: €

What happened is Array can be automatic convert to CConstVoidPointer which can be used to create string with NSSString(bytes: CConstVoidPointer, length len: Int, encoding: Uint)

Bryan Chen
  • 45,816
  • 18
  • 112
  • 143
4

Swift 3

let s = String(bytes: arr, encoding: .utf8)

Alex Shubin
  • 3,549
  • 1
  • 27
  • 32
2

I've been looking for a comprehensive answer regarding string manipulation in Swift myself. Relying on cast to and from NSString and other unsafe pointer magic just wasn't doing it for me. Here's a safe alternative:

First, we'll want to extend UInt8. This is the primitive type behind CodeUnit.

extension UInt8 {
    var character: Character {
        return Character(UnicodeScalar(self))
    }
}

This will allow us to do something like this:

let codeUnits: [UInt8] = [
    72, 69, 76, 76, 79
]

let characters = codeUnits.map { $0.character }
let string     = String(characters)

// string prints "HELLO"

Equipped with this extension, we can now being modifying strings.

let string = "ABCDEFGHIJKLMONP"

var modifiedCharacters = [Character]()
for (index, utf8unit) in string.utf8.enumerate() {

    // Insert a "-" every 4 characters
    if index > 0 && index % 4 == 0 {
        let separator: UInt8 = 45 // "-" in ASCII
        modifiedCharacters.append(separator.character)
    }
    modifiedCharacters.append(utf8unit.character)
}

let modifiedString = String(modifiedCharacters)

// modified string == "ABCD-EFGH-IJKL-MONP"
dbart
  • 5,468
  • 2
  • 23
  • 19
  • Am I correct in assuming that this will only work with ASCII character strings? I.e., it will mess things up if there are Danish letters Æ Ø Å æ ø å in the string? Or accented letters? Not to mention other alphabets like Russian cyrillic and the Greek alphabet and Chinese and ... – RenniePet Dec 14 '16 at 06:55
  • Yes, that assumption is correct. This solution will only work for single byte (ASCII) characters only and will quickly break on anything like emoji or international characters. – dbart Dec 19 '16 at 19:51
2
// Swift4
var units = [UTF8.CodeUnit]()
//
// update units
//
let str = String(decoding: units, as: UTF8.self)
Qinghua
  • 351
  • 3
  • 10
  • While this code snippet may be the solution, [including an explanation](https://meta.stackexchange.com/questions/114762/explaining-entirely-%E2%80%8C%E2%80%8Bcode-based-answers) really helps to improve the quality of your post. Remember that you are answering the question for readers in the future, and those people might not know the reasons for your code suggestion. – Narendra Jadhav Jun 17 '18 at 07:40
1

This is a possible solution (now updated for Swift 2):

let utf8 : [CChar] = [65, 66, 67, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString($0.baseAddress) }) {
    print(str) // Output: ABC
} else {
    print("Not a valid UTF-8 string") 
}

Within the closure, $0 is a UnsafeBufferPointer<CChar> pointing to the array's contiguous storage. From that a Swift String can be created.

Alternatively, if you prefer the input as unsigned bytes:

let utf8 : [UInt8] = [0xE2, 0x82, 0xAC, 0]
if let str = utf8.withUnsafeBufferPointer( { String.fromCString(UnsafePointer($0.baseAddress)) }) {
    print(str) // Output: €
} else {
    print("Not a valid UTF-8 string")
}
Martin R
  • 529,903
  • 94
  • 1,240
  • 1,382
1

I would do something like this, it may be not such elegant than working with 'pointers' but it does the job well, those are pretty much about a bunch of new += operators for String like:

@infix func += (inout lhs: String, rhs: (unit1: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 8 | UInt32(rhs.unit2)))
}

@infix func += (inout lhs: String, rhs: (unit1: UInt8, unit2: UInt8, unit3: UInt8, unit4: UInt8)) {
    lhs += Character(UnicodeScalar(UInt32(rhs.unit1) << 24 | UInt32(rhs.unit2) << 16 | UInt32(rhs.unit3) << 8 | UInt32(rhs.unit4)))
}

NOTE: you can extend the list of the supported operators with overriding + operator as well, defining a list of the fully commutative operators for String.


and now you are able to append a String with a unicode (UTF-8, UTF-16 or UTF-32) character like e.g.:

var string: String = "signs of the Zodiac: "
string += (0x0, 0x0, 0x26, 0x4b)
string += (38)
string += (0x26, 76)
holex
  • 23,961
  • 7
  • 62
  • 76
  • Just a remark: Your code creates a String from UTF-32 input (if I understand it correctly) and mine from UTF-8 input. Reading the question again I am not 100% sure what is requested here. OP mentions both "UTF-8" and "Code point" ... – Martin R Jun 28 '14 at 11:19
  • @MartinR, you are right, to be fair, I'm not sure about the real question either, the reason is just the same as you just said... – holex Jun 28 '14 at 11:40
  • Note that the UTF-8 sequence for a Unicode code point has 1, 2, 3, or 4 bytes. – Martin R Jun 28 '14 at 12:39
1

If you're starting with a raw buffer, such as from the Data object returned from a file handle (in this case, taken from a Pipe object):

let data = pipe.fileHandleForReading.readDataToEndOfFile()
var unsafePointer = UnsafeMutablePointer<UInt8>.allocate(capacity: data.count)

data.copyBytes(to: unsafePointer, count: data.count)

let output = String(cString: unsafePointer)
johnkzin
  • 21
  • 1
0

There is Swift 3.0 version of Martin R answer

public class UTF8Encoding {
  public static func encode(bytes: Array<UInt8>) -> String {
    var encodedString = ""
    var decoder = UTF8()
    var generator = bytes.makeIterator()
    var finished: Bool = false
    repeat {
      let decodingResult = decoder.decode(&generator)
      switch decodingResult {
      case .scalarValue(let char):
        encodedString += "\(char)"
      case .emptyInput:
        finished = true
      case .error:
        finished = true
      }
    } while (!finished)
    return encodedString
  }
  public static func decode(str: String) -> Array<UInt8> {
    var decodedBytes = Array<UInt8>()
    for b in str.utf8 {
      decodedBytes.append(b)
    }
    return decodedBytes
  }
}

If you want show emoji from UTF-8 string, just user convertEmojiCodesToString method below. It is working properly for strings like "U+1F52B" (emoji) or "U+1F1E6 U+1F1F1" (country flag emoji)

class EmojiConverter {
  static func convertEmojiCodesToString(_ emojiCodesString: String) -> String {
    let emojies = emojiCodesString.components(separatedBy: " ")
    var resultString = ""
    for emoji in emojies {
      var formattedCode = emoji
      formattedCode.slice(from: 2, to: emoji.length)
      formattedCode = formattedCode.lowercased()
      if let charCode = UInt32(formattedCode, radix: 16),
        let unicode = UnicodeScalar(charCode) {
        let str = String(unicode)
        resultString += "\(str)"
      }
    }
    return resultString
  }
}
Community
  • 1
  • 1