1

I need to detect the language of a string read from a pdf file the text is basically in English language, but "NLLanguageRecognizer" return that it is "Romanian"

the function I am using is :

 class func detectedLangaugeFormat(for string: String) -> String {
       if #available(iOS 12.0, *) {
           let recognizer = NLLanguageRecognizer()
           recognizer.processString(string)
        guard let languageCode = recognizer.dominantLanguage?.rawValue else { return "rtl" }
           let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
           print("lan")
           let currentLocale = NSLocale.current as NSLocale
           let direction: NSLocale.LanguageDirection = NSLocale.characterDirection(forLanguage: languageCode)
            if direction == .rightToLeft {
                return "rtl"
            }else if direction == .leftToRight {
                return "ltr"
            }
       } else {
           // Fallback on earlier versions
       }


    return "rtl"
   }

and the string given to this method is :

"\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
Awais Mobeen
  • 733
  • 11
  • 19

3 Answers3

1

One possible solution can be remove more than one spaces in string.

let regex = try? NSRegularExpression(pattern: "  +", options: .caseInsensitive)
    str = regex?.stringByReplacingMatches(in: str, options: [], range: NSRange(location: 0, length: str.count), withTemplate: " ") ?? ""

I tried your string with this regex and it worked. Language recognizer returned en lang code.

s3cretshadow
  • 241
  • 3
  • 7
1

For some reason, white spaces and newlines make the result of processString(_:) to be inefficient. What you should do is to get rid of them before passing the string to your method:

let givenString = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
let trimmedString = givenString.trimmingCharacters(in: .whitespacesAndNewlines)

let result = detectedLangaugeFormat(for: trimmedString)
print(result) // ltr

At this point, it should be recognizable as English (if you print detectedLangauge inside your method instead of "lan", you'll find it "English").

let detectedLangauge = Locale.current.localizedString(forIdentifier: languageCode)
print(detectedLangauge) // Optional("English")
Ahmad F
  • 30,560
  • 17
  • 97
  • 143
0

Remove non-alphabetic[WhiteSpaces,!,@,#, etc] char in the String then try to detect language.

extension String{
    func findFirstAlphabetic() -> String.Index?{
        for index  in self.indices{
            if String(self[index]).isAlphanumeric == true{
                return index
            }
        }
        return nil
    }
    var isAlphanumeric: Bool {
        return !isEmpty && range(of: "[^a-zA-Z0-9]", options: .regularExpression) == nil
    }
    func alphabetic_Leading_SubString() -> String?{
        if let startIndex =  self.findFirstAlphabetic(){
            let newSubString = self[startIndex..<self.endIndex]
            return String(newSubString)
        }
        return nil
    }
}

Usage :-

let string = "\r\n                A Simple PDF File \r\n                   This is a small demonstration .pdf file - \r\n                   just for use in the Virtual Mechanics tutorials. More text. And more \r\n                   text. And more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. Boring, zzzzz. And more text. And more text. And \r\n                   more text. And more text. And more text. And more text. And more text. \r\n                   And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. And more text. And more text. Even more. Continued on page 2 ...\r\n                Simple PDF File 2 \r\n                   ...continued from page 1. Yet more text. And more text. And more text. \r\n                   And more text. And more text. And more text. And more text. And more \r\n                   text. Oh, how boring typing this stuff. But not as boring as watching \r\n                   paint dry. And more text. And more text. And more text. And more text. \r\n                   Boring.  More, a little more text. The end, and just as well. "
detectedLangaugeFormat(for: string.alphabetic_Leading_SubString()!)
Manikandan
  • 1,195
  • 8
  • 26