find non-alphabet words in any language with swift

Question

I have multiple strings for some languages(english, italian, arabic, french ...etc). I want to see a list of words other than that language's alphabet.

For example for English:

"thisŞĞstring" -> return false

"corect string format" -> return true

For example for Arabic:

"كلمةabc" -> return false

"كلمة" -> return true

I don't want to enter the alphabet of all languages one by one. Is there a way to do what I want?

You probably want to start with a Unicode character database, which you can get from the Unicode web site. — matt, Oct 03 '21 at 14:41
Get All language alphabet array then compare with your string. — Yogesh Patel, Oct 03 '21 at 14:53
How about for English, “His name is José” or, “She provided her résumé.” True or false? — Rob, Oct 03 '21 at 15:26
Cretea a list of alphabetic letters and use it with `NSCharacterSet`. — El Tomato, Oct 03 '21 at 22:13

Rob · Answer 1 · 2021-10-03T16:11:15.973

It is not quite what you’re looking for, but regex has the ability to find letters that do not conform to a particular script, e.g.:

let string = "he said こんにちは"
let regex = try NSRegularExpression(pattern: #"[\p{Letter}--\p{script=latin}]+"#)
if 
    let match = regex.firstMatch(in: string, options: [], range: NSRange(string.startIndex..., in: string)), 
    let range = Range(match.range, in: string) 
{
    print(string[range])  // こんにちは
}

Or if you use [\p{Letter}--\p{script=arabic}]+ with “كلمةabc”, it will return “abc”.

Again, likely not quite what you are looking for, but you can use NaturalLanguage framework to parse text:

import NaturalLanguage

let text = "he said こんにちは"

let tagger = NLTagger(tagSchemes: [.language, .script])
tagger.string = text
let range = text.startIndex..<text.endIndex
let options: NLTagger.Options = [.omitWhitespace, .joinContractions]
tagger.enumerateTags(in: range, unit: .word, scheme: .language, options: options) { tag, range in
    guard let tag = tag else { return true }
    
    print(tag, String(text[range]))
    return true
}

Returning:

NLTag(_rawValue: en) he
NLTag(_rawValue: en) said
NLTag(_rawValue: ja) こんにちは

Or if you use .script in enumerateTags:

NLTag(_rawValue: Latn) he
NLTag(_rawValue: Latn) said
NLTag(_rawValue: Jpan) こんにちは

find non-alphabet words in any language with swift

1 Answers1