componentsseparatedbystring by multiple separators in Swift

Question

So here is the string s:

"Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."

I want them to be separated to a array as:

["Hi", "How are you", "I'm fine", "It is 6 p.m", "Thank you", "That's it"]

Which means the separators should be ". " + "? " + "! "

I've tried:

let charSet = NSCharacterSet(charactersInString: ".?!")
let array = s.componentsSeparatedByCharactersInSet(charSet)

But it will separate p.m. to two elements too. Result:

["Hi", " How are you", " I'm fine", " It is 6 p", "m", " Thank you", " That's it"]

I've also tried

let array = s.componentsSeparatedByString(". ")

It works well for separating ". " but if I also want to separate "? ", "! ", it become messy.

So any way I can do it? Thanks!

The easy way is to use `componentsSeparatedByString`, but it will still fail. Since not every sentence ends with a space. — R Menke, Dec 13 '15 at 02:29
@RMenke i've tried `let array = s.componentsSeparatedByString(". ")` It works well for separating `". "` but if I also want to separate `"? "`, `"! "`, it become messy. — He Yifei 何一非, Dec 13 '15 at 02:31
@RMenke just ignore the last sentence in the string which is not ends with a space. How can I separate the previous sentences? :) — He Yifei 何一非, Dec 13 '15 at 02:33

score 6 · Accepted Answer · answered Dec 13 '15 at 03:43

6

There is a method provided that lets you enumerate a string. You can do so by words or sentences or other options. No need for regular expressions.

let s = "Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."
var sentences = [String]()
s.enumerateSubstringsInRange(s.startIndex..<s.endIndex, options: .BySentences) { 
    substring, substringRange, enclosingRange, stop in
    sentences.append(substring!)
}
print(sentences)

The result is:

["Hi! ", "How are you? ", "I\'m fine. ", "It is 6 p.m. ", "Thank you! ", "That\'s it."]

answered Dec 13 '15 at 03:43

rmaddy

314,917
42
532
579

so `BySentences` is a function included in system itself? – He Yifei 何一非 Dec 13 '15 at 03:49
1

Yes. See the docs for `NSString enumerateSubstringsInRange:options:usingBlock:`. – rmaddy Dec 13 '15 at 03:50

Rob · Answer 2 · 2016-10-28T17:05:00.847

rmaddy's answer is correct (+1). A Swift 3 implementation is:

var sentences = [String]()

string.enumerateSubstrings(in: string.startIndex ..< string.endIndex, options: .bySentences) { substring, substringRange, enclosingRange, stop in
    sentences.append(substring!)
}

You can also use regular expression, NSRegularExpression, though it's much hairier than rmaddy's .bySentences solution. In Swift 3:

var sentences = [String]()

let regex = try! NSRegularExpression(pattern: "(^|\\s+)(\\w.*?[.!?]+)(?=(\\s+|$))")
regex.enumerateMatches(in: string, range: NSMakeRange(0, string.characters.count)) { match, flags, stop in
    sentences.append((string as NSString).substring(with: match!.rangeAt(2)))
}

Or Swift 2:

let regex = try! NSRegularExpression(pattern: "(^|\\s+)(\\w.*?[.!?]+)(?=(\\s+|$))", options: [])
var sentences = [String]()
regex.enumerateMatchesInString(string, options: [], range: NSMakeRange(0, string.characters.count)) { match, flags, stop in
    sentences.append((string as NSString).substringWithRange(match!.rangeAtIndex(2)))
}

The [.!?] syntax matches any of those three characters. The | means "or". The ^ matches the start of the string. The $ matches the end of the string. The \\s matches a whitespace character. The \\w matches a "word" character. The * matches zero or more of the preceding character. The + matches one or more of the preceding character. The (?=) is a look-ahead assertion (e.g. see if there's something there, but don't advance through that match).

I've tried to simplify this a bit, and it's still pretty complicated. Regular expressions offer rich text pattern matching, but, admittedly, it is a little dense when you first use it. But this rendition matches (a) repeated punctuation (e.g. "Thank you!!!"), (b) leading spaces, and (c) trailing spaces, too.

score 2 · Answer 3 · answered Mar 31 '19 at 22:46

If the splitting basis is something a little more esoteric than sentences, this extension could work.

extension String {
    public func components(separatedBy separators: [String]) -> [String] {
        var output: [String] = [self]
        for separator in separators {
            output = output.flatMap { $0.components(separatedBy: separator) }
        }
        return output.map { $0.trimmingCharacters(in: .whitespaces)}
    }
}

let artists = "Rihanna, featuring Calvin Harris".components(separated by: [", with", ", featuring"])

This is a very good solution! But I need only the first occurrence of each set of characters. Is it possible to achieve? — Diego Jiménez, Jan 11 '21 at 20:38

score 0 · Answer 4 · answered Dec 13 '15 at 03:36

0

I tried to find a regex to solve this too: (([^.!?]+\s)*\S+(\.|!|\?)) Here the explanation from regexper and an example

answered Dec 13 '15 at 03:36

mt81

3,288
1
26
35

score 0 · Answer 5 · edited May 23 '17 at 12:19

Well I've found a regex too from here

var pattern = "(?<=[.?!;…])\\s+(?=[\\p{Lu}\\p{N}])"

let s = "Hi! How are you? I'm fine. It is 6 p.m. Thank you! That's it."

let sReplaced = s.stringByReplacingOccurrencesOfString(pattern, withString:"[*-SENTENCE-*]" as String, options:NSStringCompareOptions.RegularExpressionSearch, range:nil)

let array = sReplaced.componentsSeparatedByString("[*-SENTENCE-*]")

Perhaps it's not a good way as it has to first replace and than separate the string. :)

UPDATE:

For regex part, if you also want to match Chinese/Japanese punctuations (which space after each punctuation is not necessary), you can use the following one:

((?<=[.?!;…])\\s+|(?<=[。！？；…])\\s*)(?=[\\p{L}\\p{N}])

componentsseparatedbystring by multiple separators in Swift

5 Answers5