0

I have multiple strings for some languages(english, italian, arabic, french ...etc). I want to see a list of words other than that language's alphabet.

For example for English:

"thisŞĞstring" -> return false

"corect string format" -> return true

For example for Arabic:

"كلمةabc" -> return false

"كلمة" -> return true

I don't want to enter the alphabet of all languages one by one. Is there a way to do what I want?

Joakim Danielson
  • 43,251
  • 5
  • 22
  • 52
ursan526
  • 485
  • 3
  • 10

1 Answers1

2

It is not quite what you’re looking for, but regex has the ability to find letters that do not conform to a particular script, e.g.:

let string = "he said こんにちは"
let regex = try NSRegularExpression(pattern: #"[\p{Letter}--\p{script=latin}]+"#)
if 
    let match = regex.firstMatch(in: string, options: [], range: NSRange(string.startIndex..., in: string)), 
    let range = Range(match.range, in: string) 
{
    print(string[range])  // こんにちは
}

Or if you use [\p{Letter}--\p{script=arabic}]+ with “كلمةabc”, it will return “abc”.


Again, likely not quite what you are looking for, but you can use NaturalLanguage framework to parse text:

import NaturalLanguage

let text = "he said こんにちは"

let tagger = NLTagger(tagSchemes: [.language, .script])
tagger.string = text
let range = text.startIndex..<text.endIndex
let options: NLTagger.Options = [.omitWhitespace, .joinContractions]
tagger.enumerateTags(in: range, unit: .word, scheme: .language, options: options) { tag, range in
    guard let tag = tag else { return true }
    
    print(tag, String(text[range]))
    return true
}

Returning:

NLTag(_rawValue: en) he
NLTag(_rawValue: en) said
NLTag(_rawValue: ja) こんにちは

Or if you use .script in enumerateTags:

NLTag(_rawValue: Latn) he
NLTag(_rawValue: Latn) said
NLTag(_rawValue: Jpan) こんにちは
Rob
  • 415,655
  • 72
  • 787
  • 1,044