It is not quite what you’re looking for, but regex has the ability to find letters that do not conform to a particular script, e.g.:
let string = "he said こんにちは"
let regex = try NSRegularExpression(pattern: #"[\p{Letter}--\p{script=latin}]+"#)
if
let match = regex.firstMatch(in: string, options: [], range: NSRange(string.startIndex..., in: string)),
let range = Range(match.range, in: string)
{
print(string[range]) // こんにちは
}
Or if you use [\p{Letter}--\p{script=arabic}]+
with “كلمةabc”, it will return “abc”.
Again, likely not quite what you are looking for, but you can use NaturalLanguage
framework to parse text:
import NaturalLanguage
let text = "he said こんにちは"
let tagger = NLTagger(tagSchemes: [.language, .script])
tagger.string = text
let range = text.startIndex..<text.endIndex
let options: NLTagger.Options = [.omitWhitespace, .joinContractions]
tagger.enumerateTags(in: range, unit: .word, scheme: .language, options: options) { tag, range in
guard let tag = tag else { return true }
print(tag, String(text[range]))
return true
}
Returning:
NLTag(_rawValue: en) he
NLTag(_rawValue: en) said
NLTag(_rawValue: ja) こんにちは
Or if you use .script
in enumerateTags
:
NLTag(_rawValue: Latn) he
NLTag(_rawValue: Latn) said
NLTag(_rawValue: Jpan) こんにちは