1

In a mobile App I use an API that can only handle about 300 words. How can I trimm a string in Swift so that it doesn't contain more words?

The native .trimmingCharacters(in: CharacterSet) does not seem to be able to do this as it is intended to trimm certain characters.

  • 1
    Words or characters? For characters, you can use `prefix(_:)` https://developer.apple.com/documentation/swift/string/2894830-prefix, for words, you might want to count (the number of words, separated by what? a space)? Then, call `prefix(_:)` on it, and recompose your string. – Larme Jul 23 '21 at 11:36

1 Answers1

1

There is no off-the shelf way to limit the number of words in a string.

If you look at this post, it documents using the method enumerateSubstrings(in: Range) and setting an option of .byWords. It looks like it returns an array of Range values.

You could use that to create an extension on String that would return the first X words of that string:

extension String {
    func firstXWords(_ wordCount: Int) -> Substring {
        var ranges: [Range<String.Index>] = []
        self.enumerateSubstrings(in: self.startIndex..., options: .byWords) { _, range, _, _ in
            ranges.append(range)
        }
        if ranges.count > wordCount - 1 {
            return self[self.startIndex..<ranges[wordCount - 1].upperBound]
        } else {
            return self[self.startIndex..<self.endIndex]
        }
    }
}

If we then run the code:

let sentence = "I want to an algorithm that could help find out how many words are there in a string separated by space or comma or some character. And then append each word separated by a character to an array which could be added up later I'm making an average calculator so I want the total count of data and then add up all the words. By words I mean the numbers separated by a character, preferably space Thanks in advance"

print(sentence.firstXWords(10))

The output is:

I want to an algorithm that could help find out

Using enumerateSubstrings(in: Range) is going to give much better results than splitting your string using spaces, since there are a lot more separators than just spaces in normal text (newlines, commas, colons, em spaces, etc.) It will also work for languages like Japanese and Chinese that often don't have spaces between words.

You might be able to rewrite the function to terminate the enumeration of the string as soon as it reaches the desired number of words. If you want a small percentage of the words in a very long string that would make it significantly faster (the code above should have O(n) performance, although I haven't dug deeply enough to be sure of that. I also couldn't figure out how to terminate the enumerateSubstrings() function early, although I didn't try that hard.)

Leo Dabus provided an improved version of my function. It extends StringProtocol rather than String, which means it can work on substrings. Plus, it stops once it hits your desired word count, so it will be much faster for finding the first few words of very long strings:

extension StringProtocol {
    func firstXWords(_ n: Int) -> SubSequence {
        var endIndex = self.endIndex
        var words = 0
        enumerateSubstrings(in: startIndex..., options: .byWords) { _, range, _, stop in
            words += 1
            if words == n {
                stop = true
                endIndex = range.upperBound
            }
        }
        return self[..<endIndex] }
}
Duncan C
  • 128,072
  • 22
  • 173
  • 272
  • You are using the sentence startIndex instead of the string index. Note that there is no need to continue enumerating the words once you reach the desired number of words. You can simply set the stop (4th) parameter which you are ignoring to true. You should also extend StringProtocol instead to support substrings as well – Leo Dabus Jul 23 '21 at 17:38
  • `extension StringProtocol {` `func firstWords(limitedTo n: Int) -> SubSequence {` `var endIndex = self.endIndex` `var words = 0` `enumerateSubstrings(in: startIndex..., options: .byWords) { _, range, _, stop in` `words += 1` `if words == n {` `stop = true` `endIndex = range.upperBound` `}` `}` `return self[.. – Leo Dabus Jul 23 '21 at 17:47
  • Leo, thanks for the feedback. I figured there was a way to terminate the evaluation, but that was a quick answer and I didn't dig that deeply. I also wasn't able to figure out how to return StringProtocol rather than String. Extending StringProtocol makes sense. Do you want to submit your own answer? – Duncan C Jul 23 '21 at 20:21
  • Good catch on using sentence.startIndex. (fixed).) I started out writing one-off code, and didn't review the edits to convert it to an extension carefully enough. – Duncan C Jul 23 '21 at 20:23
  • @LeoDabus if you extend StringProtocol, how do you call the function? When I try to use your version it says "value of type 'String' has no member 'firstXWords`". – Duncan C Jul 24 '21 at 17:10
  • Which function? It all depends which function you need. Most of them are declared on `RangeReplaceableCollection`. So if that's the case change your extension declaration to `extension StringProtocol where Self: RangeReplaceableCollection`. Note that I have named my method `firstWords(limitedTo:)` if you are just trying to call it. – Leo Dabus Jul 24 '21 at 17:33
  • 1
    Oh, never mind. I tried to use your version and had a typo. – Duncan C Jul 24 '21 at 18:41
  • @LeoDabus, how did you figure out the parameters to the closure passed to `enumerateSubstrings()`? The docs are... nonexistent. – Duncan C Jul 25 '21 at 17:26
  • option click the method – Leo Dabus Jul 25 '21 at 17:30
  • `body` The closure executed for each substring in the enumeration. The closure takes four arguments: **• The enumerated substring. If substringNotRequired is included in opts, this parameter is nil for every execution of the closure. • The range of the enumerated substring in the string that enumerate(in:options:_:) was called on.** – Leo Dabus Jul 25 '21 at 17:30
  • **• The range that includes the substring as well as any separator or filler characters that follow. For instance, for lines, enclosingRange contains the line terminators. The enclosing range for the first string enumerated also contains any characters that occur before the string. Consecutive enclosing ranges are guaranteed not to overlap, and every single character in the enumerated range is included in one and only one enclosing range. • An inout Boolean value that the closure can use to stop the enumeration by setting stop = true.** – Leo Dabus Jul 25 '21 at 17:30
  • When I option-click the method, all it says is `func enumerateSubstrings(in range: R, options opts: String.EnumerationOptions = [], _ body: @escaping (String?, Range, Range, inout Bool) -> Void) where R : RangeExpression, R.Bound == String.Index` and "no overview available." – Duncan C Jul 25 '21 at 19:56