This is slightly different than what you asked for, but depending upon why you’re doing this, you can consider the NaturalLanguage framework. E.g.
import NaturalLanguage
let text = "I am trying to create an array from a sentence. But I don't understand, Why isn't the last fullstop removed, from the last word."
var words: [String] = []
let tagger = NLTagger(tagSchemes: [.lexicalClass])
tagger.string = text
let options: NLTagger.Options = [.omitWhitespace, .joinContractions]
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, range in
if let tag = tag {
words.append(String(text[range]))
}
return true
}
print(words)
["I", "am", "trying", "to", "create", "an", "array", "from", "a", "sentence", ".", "But", "I", "don\'t", "understand", ",", "Why", "isn\'t", "the", "last", "fullstop", "removed", ",", "from", "the", "last", "word", "."]
What’s interesting about this is that the tag
property will tell you the parts of speech, what’s a sentence terminator, etc., e.g.:
tagger.enumerateTags(in: text.startIndex..<text.endIndex, unit: .word, scheme: .lexicalClass, options: options) { tag, range in
if let tag = tag {
print(text[range], tag.rawValue)
}
return true
}
Producing:
I Pronoun
am Verb
trying Verb
to Particle
create Verb
an Determiner
array Noun
from Preposition
a Determiner
sentence Noun
. SentenceTerminator
But Conjunction
I Pronoun
don't Verb
understand Verb
, Punctuation
Why Pronoun
isn't Verb
the Determiner
last Adjective
fullstop Noun
removed Verb
, Punctuation
from Preposition
the Determiner
last Adjective
word Noun
. SentenceTerminator
Or, perhaps you don’t really care about the punctuation and simply want to have this broken up into sentences and the sentences broken up into words:
var sentences: [[String]] = []
let sentenceTokenizer = NLTokenizer(unit: .sentence)
sentenceTokenizer.string = text
sentenceTokenizer.enumerateTokens(in: text.startIndex ..< text.endIndex) { range, _ in
let sentence = String(text[range])
let wordTokenizer = NLTokenizer(unit: .word)
wordTokenizer.string = sentence
let words = wordTokenizer.tokens(for: sentence.startIndex ..< sentence.endIndex)
.map { String(sentence[$0]) }
sentences.append(words)
return true
}
print(sentences)
[
["I", "am", "trying", "to", "create", "an", "array", "from", "a", “sentence"],
["But", "I", "don\'t", "understand", "Why", "isn\'t", "the", "last", "fullstop", "removed", "from", "the", "last", “word"]
]
There are lots of options here between NLTagger
and NLTokenizer
. Depending upon what problem you’re really trying to solve, these might be better than manipulating strings yourself.
As Sultan said, you can obviously just insert spaces and the split
the string, though I might suggest adding other punctuation symbols and include +
to match more or one characters in the case of consecutive punctuation marks (notably ellipses, ...
), e.g.
let words = text.replacingOccurrences(of: "[,.:;!?]+", with: " $0", options: .regularExpression)
.split(separator: " ")