Number of words in a Swift String for word count calculation

Question

I want to make a procedure to find out how many words are there in a string, separated by space, or comma, or some other character. And then add up the total later.

I'm making an average calculator, so I want the total count of data and then add up all the words.

Leo Dabus · Accepted Answer · 2019-06-05T14:29:46.893

update: Xcode 10.2.x • Swift 5 or later

Using Foundation method enumerateSubstrings(in: Range)and setting .byWords as options:

let sentence = "I want to an algorithm that could help find out how many words are there in a string separated by space or comma or some character. And then append each word separated by a character to an array which could be added up later I'm making an average calculator so I want the total count of data and then add up all the words. By words I mean the numbers separated by a character, preferably space Thanks in advance"

var words: [Substring] = []
sentence.enumerateSubstrings(in: sentence.startIndex..., options: .byWords) { _, range, _, _ in
    words.append(sentence[range])
}
print(words) // "["I", "want", "to", "an", "algorithm", "that", "could", "help", "find", "out", "how", "many", "words", "are", "there", "in", "a", "string", "separated", "by", "space", "or", "comma", "or", "some", "character", "And", "then", "append", "each", "word", "separated", "by", "a", "character", "to", "an", "array", "which", "could", "be", "added", "up", "later", "I\\'m", "making", "an", "average", "calculator", "so", "I", "want", "the", "total", "count", "of", "data", "and", "then", "add", "up", "all", "the", "words", "By", "words", "I", "mean", "the", "numbers", "separated", "by", "a", "character", "preferably", "space", "Thanks", "in", "advance"]\n"
print(words.count)  // 79

Or using native Swift 5 new Character property isLetter and the split method:

let words =  sentence.split { !$0.isLetter }

print(words) // "["I", "want", "to", "an", "algorithm", "that", "could", "help", "find", "out", "how", "many", "words", "are", "there", "in", "a", "string", "separated", "by", "space", "or", "comma", "or", "some", "character", "And", "then", "append", "each", "word", "separated", "by", "a", "character", "to", "an", "array", "which", "could", "be", "added", "up", "later", "I", "m", "making", "an", "average", "calculator", "so", "I", "want", "the", "total", "count", "of", "data", "and", "then", "add", "up", "all", "the", "words", "By", "words", "I", "mean", "the", "numbers", "separated", "by", "a", "character", "preferably", "space", "Thanks", "in", "advance"]\n"

print(words.count)  // 80

Extending StringProtocol to support Substrings as well:

extension StringProtocol {
    var words: [SubSequence] { 
        return split { !$0.isLetter } 
    }
    var byWords: [SubSequence] {
        var byWords: [SubSequence] = []
        enumerateSubstrings(in: startIndex..., options: .byWords) { _, range, _, _ in
            byWords.append(self[range])
        }
        return byWords
    }
}

sentence.words  // ["I", "want", "to", "an", "algorithm", "that", "could", "help", "find", "out", "how", "many", "words", "are", "there", "in", "a", "string", "separated", "by", "space", "or", "comma", "or", "some", "character", "And", "then", "append", "each", "word", "separated", "by", "a", "character", "to", "an", "array", "which", "could", "be", "added", "up", "later", "I", "m", "making", "an", "average", "calculator", "so", "I", "want", "the", "total", "count", "of", "data", "and", "then", "add", "up", "all", "the", "words", "By", "words", "I", "mean", "the", "numbers", "separated", "by", "a", "character", "preferably", "space", "Thanks", "in", "advance"]

This also removes apostrophes so the word I'm is reduced to Im — Ian, Jun 06 '15 at 13:26
+1 for the `enumerateSubstrings` solution, because it also works with languages that don't use space frequently, like **Japanese or Chinese**. background info: https://medium.com/@sorenlind/three-ways-to-enumerate-the-words-in-a-string-using-swift-7da5504f0062 — Daniel, Jun 10 '17 at 00:36
The `split { !$0.isLetter }` is not too friendly to the strings of with numeric and special symbols of this kind: `let sentence = "I need some 100% algorithm that could split at least 20 words"`. The `enumerateSubstrings(in: )` method doesn't split "I'm" into pronoun "I" and verb "'m" as we can see right from the output. — Paul B, Jun 01 '20 at 16:06
@PaulB Anyway there is no magic behind the methods. If you want to split those as well you would have to run your own custom method replacing those occurrences before enumerating. — Leo Dabus, Jun 01 '20 at 16:25
You can try also `split { $0.isPunctuation || $0.isWhitespace }` but I don't know what is your purpose. Are you gonna show `m` as a word? Or `wouldn't` are you gonna show `wouldn` and `t` separated? I think that showing them as a single word is correct. — Leo Dabus, Jun 01 '20 at 16:32
I was trying to say that splitting into words is a bit fuzzy task, @Leo. I tried to express my purpose (and approach) [here](https://stackoverflow.com/questions/30679564/number-of-words-in-a-swift-string-for-word-count-calculation/62140217#62140217). — Paul B, Jun 01 '20 at 20:25

score 5 · Answer 2 · edited Jun 02 '20 at 13:31

5

let sentences = "Let there be light!"
let separatedCount = sentences.split(whereSeparator: { ",.! ".contains($0) }).count

print(separatedCount) // prints out 4 (if you just want the array, you can omit ".count")

If you have a specific condition of punctuations you want to use, you could use this code. Also if you prefer to use swift codes only :).

edited Jun 02 '20 at 13:31

Paul B

3,989
33
46

answered Jun 06 '15 at 07:14

Eendje

8,815
1
29
31

While this might look better, performance wise it's not as good as the answer Leo provided. Although It shouldn't matter if the strings aren't astronomically long. – Eendje Jun 06 '15 at 08:04
no need to use a closure. just `split(whereSeparator: ",.! ".contains)` is enough – Leo Dabus Mar 10 '21 at 16:33

MirekE · Answer 3 · 2015-06-06T20:34:58.513

2

You may want to try componentsSeparatedByCharactersInset:

let s = "Let there be light"

let c = NSCharacterSet(charactersInString: " ,.")
let a = s.componentsSeparatedByCharactersInSet(c).filter({!$0.isEmpty})

// a = ["Let", "there", "be", "light"]

edited Jun 06 '15 at 20:34

answered Jun 06 '15 at 06:32

MirekE

11,515
5
35
28

Thanks! This might work! And how to add all the numbers in the array – Dreamjar Jun 06 '15 at 06:34
1

This would not work for a period followed by a space. It would create an extra empty string for each occurrence – Leo Dabus Jun 06 '15 at 07:05
Needs to be actualized like this: `let c = CharacterSet(charactersIn: " ,.") let a = s.components(separatedBy:c).filter({!$0.isEmpty})` Otherwise won't work anymore. Tried to suggest appropriate correction, but it was rejected. – Paul B Jun 02 '20 at 10:28

score 1 · Answer 4 · answered Jul 16 '19 at 06:31

You can use regular expression and extension to simplify your code like this:

extension String {
    var wordCount: Int {
        let regex = try? NSRegularExpression(pattern: "\\w+")
        return regex?.numberOfMatches(in: self, range: NSRange(location: 0, length: self.utf16.count)) ?? 0
    }
}

let text = "I live in iran and i love Here"
print(text.wordCount) // 8

Paul B · Answer 5 · 2021-11-02T05:24:54.967

If you are aiming at fresh operating systems (such as iOS13) there is no need to reinvent the wheel trying to count words by yourself. You can benefit from a powerful API specially dedicated for this purpose. It can split text into words for many languages you don't even know about, it can and classify parts of speech show lemmas, detect script and more. Check this in playground.

import NaturalLanguage
let taggerLexical = NLTagger(tagSchemes: [.lexicalClass, .lemma])
let txt = "I'm an architector ‍ by 90%. My family ‍‍‍ and I live in ."
taggerLexical.string = txt
let lexicalTags = NSCountedSet()
taggerLexical.enumerateTags(in: txt.startIndex..<txt.endIndex, unit: .word, scheme: .lexicalClass, options: [.omitPunctuation, .omitWhitespace]) { tag, tokenRange in
    if let tag = tag {
        lexicalTags.add(tag)
        let lemma = taggerLexical.tag(at: tokenRange.lowerBound, unit: .word, scheme: .lemma).0?.rawValue ?? ""
        let word = String(txt[tokenRange])
        print("\(word): \(tag.rawValue)\(word == lemma ? "" : " | Lemma: \(lemma) " )")
    }
    return true
}
let sortedLexicalTagCount = lexicalTags.allObjects.map({ (($0 as! NLTag), lexicalTags.count(for: $0))}).sorted(by: {$0.1 > $1.1})
print("Total word count: \(sortedLexicalTagCount.map({ $0.1}).reduce(0, +)) \nTotal word count without grapheme clusters: \(sortedLexicalTagCount.compactMap({ $0.0 == NLTag.otherWord ? nil : $0.1 }).reduce(0, +)) \nDetails: \(sortedLexicalTagCount.map {($0.0.rawValue, $0.1)})")

// Output:
I: Pronoun
'm: Verb | Lemma: be 
an: Determiner
architector: Adjective | Lemma:  
‍: OtherWord | Lemma:  
by: Preposition
90: Number | Lemma:  
My: Determiner | Lemma: I 
family: Noun
‍‍‍: OtherWord | Lemma:  
and: Conjunction
I: Pronoun
live: Verb
in: Preposition
: OtherWord | Lemma:  
Total word count: 15 
Total word count without grapheme clusters: 12 
Details: [("OtherWord", 3), ("Pronoun", 2), ("Determiner", 2), ("Verb", 2), ("Preposition", 2), ("Number", 1), ("Noun", 1), ("Conjunction", 1), ("Adjective", 1)]

For older Apple operating systems using preceding linguisticTags API is an option.

import Foundation
let linguisticTags = txt.linguisticTags(in: text.startIndex..., scheme: NSLinguisticTagScheme.tokenType.rawValue)
print("Total word count: \(linguisticTags.filter({ [NSLinguisticTag.word.rawValue, NSLinguisticTag.other.rawValue].contains($0) }).count)\nTotal word count without grapheme clusters: \(linguisticTags.filter({ [NSLinguisticTag.word.rawValue].contains($0) }).count)")
// Output:
Total word count: 15
Total word count without grapheme clusters: 12

Another option is to use NSRegularExpression. It knows how match word boundaries (\\b), word (\\w) and non-word (\\W) symbols. Using .numberOfMatches(in: , range:..) looks better from the calculation effectiveness point of view since it returns only number of matches but not matches themselves. Yet there are issues for strings with emojis for this approach.

extension String {
    private var regexMatchWords: NSRegularExpression? { try? NSRegularExpression(pattern: "\\w+") }
    var aproxWordCount: Int {
        guard let regex = regexMatchWords else { return 0 }
        return regex.numberOfMatches(in: self, range: NSRange(self.startIndex..., in: self))
    }
    var wordCount: Int {
        guard let regex = regexMatchWords else { return 0 }
        return regex.matches(in: self, range: NSRange(self.startIndex..., in: self)).reduce(0) { (r, match) in
                    r + (Range(match.range, in: self) == nil ? 0 : 1)
                }

    }
    var words: [String] {
        var w = [String]()
        guard let regex = regexMatchWords else { return [] }
        regex.enumerateMatches(in: self, range: NSRange(self.startIndex..., in: self)) { (match, _, _) in
            guard  let match = match else { return }
            guard let range = Range(match.range, in: self) else { return }
            w.append(self[range])
        }
        return w
    }
}
let text = "We're a family ‍‍‍ of 4. Next week we'll go to ."
print("Arpoximate word count: \(text.aproxWordCount)\nWord count: \(text.wordCount)\nWords:\(text.words)")
// Output:
Arpoximate word count: 15
Word count: 12
Words:["We", "re", "a", "family", "of", "4", "Next", "week", "we", "ll", "go", "to"]

score 0 · Answer 6 · edited Jun 02 '20 at 12:56

You may try some of these options:

let name = "some name with, space # inbetween -- and more"
let wordsSeparatedBySpaces = name.components(separatedBy: .whitespacesAndNewlines) // CharacterSet
let wordsSeparatedByPunctuations = name.components(separatedBy: .punctuationCharacters) // CharacterSet
// (can be separated by some string
let wordsSeparatedByHashChar = name.components(separatedBy: "#") // String protocol
let wordsSeparatedByComma = name.components(separatedBy: ",") // String protocol
let wordsSeparatedBySomeString = name.components(separatedBy: " -- ") // String protocol

let total = wordsSeparatedBySpaces.count + wordsSeparatedByPunctuations.count + wordsSeparatedByHashChar.count + wordsSeparatedByComma.count
print("Total number of separators = \(total)")

That's not helping! Adding up the separators? 13 is *not* the answer he's looking for... — Grimxn, Jun 06 '15 at 06:42

score 0 · Answer 7 · edited Nov 25 '19 at 12:35

0

This works for me,

let spaces=CharacterSet.whitespacesAndNewlines.union(.punctuationCharacters)

let words = YourString.components(separatedBy: spaces)

if words.count > 8 { return 110 } else { return 90 }

edited Nov 25 '19 at 12:35

Mohammad Reza Shahrestani

1,119
3
17
28

answered Nov 25 '19 at 11:39

gihan kosala

1
1

Number of words in a Swift String for word count calculation

7 Answers7

Linked

Related