2

I'm using an image processing API that reads text on the image, and from the string data I get back, I need to extract words in the English language or dictionary and common first and last names. In other words I'm getting the text I need within the string but also some garbage in the results that I need to filter out (non-words). What's the best approach here? I've looked into NSLinguisticTagger but it's not 100% on point with what I'm doing, any other suggestions?

Would REGEX help me here? I can't figure out how to form the syntax for a pattern that will match only words?

Below are 2 examples of sample string's I'm trying to pull words/names from:

(1) "PUMPER im CasSICI 1111 Cassu with Andrew Webster PUMPE im CasSICI 1111 Cassu with Andrew Webster"
// I need to extract: "Pumper With Andrew Webster"

(2) "SHARON M DRAPER000k in the powerful Hazelwood High trilogyFORGEDBY FIRESWINNER SHARON M DRAPER 000k in the powerful Hazelwood High trilogy FORGED BY FIRE S WINNER"
// I need to extract "Sharon Hazelwood High Draper in the powerful trilogy forced by fire winner"

Midhun MP
  • 103,496
  • 31
  • 153
  • 200
GarySabo
  • 5,806
  • 5
  • 49
  • 124
  • I don't know how complex your problem is exactly, but you might need to look into stuff like NLP ( Natural Language Processing ), especially NER ( Name Entity Recognition ). – Randy Dec 22 '16 at 23:14
  • @GraySabo, I am not understand this line:`I need to extract words in English and names out of a string` what's mean of this? – aircraft Dec 23 '16 at 02:14
  • Do you have a list of words/names you'd like to extract or are you looking for a pattern? If you're looking for a pattern, then could to use something like `Regex`. http://stackoverflow.com/a/27880748/4475605 I threw a quick demo together https://github.com/AdrianBinDC/RegexDemo. If you're searching for specific strings, that's pretty easy. – Adrian Dec 23 '16 at 03:26
  • Thanks guys, what I mean is I need to identify whether letters in a string form a word in the english dictionary (or slag etc.) or a common first or last name. – GarySabo Dec 23 '16 at 04:45
  • Re: common names, here's a list that might be helpful for checking your string against. https://github.com/fivethirtyeight/data/tree/master/most-common-name. Backing up a step, what OCR framework are you using? Will the font being OCR'd be the same every time? If so, you might be able to train your OCR w/ the font so you've got less of a hot OCR mess to deal with, which reduces the work you've got to do w/ pulling out data from the strings. – Adrian Dec 23 '16 at 17:52
  • @Adrian thanks for the name list...using Google Cloud Vision, I thought maybe Google would have a method in the API to only give back words in english but I haven't found anything in the docs yet. – GarySabo Dec 24 '16 at 14:47
  • I've got an idea I think will get you most of the way there. Let me tinker around and I'll chuck something on Github when I'm done – Adrian Dec 24 '16 at 16:27

1 Answers1

1

I've cobbled this class together, which is a mixture of real and pseudocode. I would create a singleton class for first names and last names. See comments in code for details. This isn't the whole thing, but it should solve most of your problem.

Update Tweaked the cleanUpString method with a switch statement.

Update 2 Added this to take care of whatever UITextChecker doesn't...

return UIReferenceLibraryViewController.dictionaryHasDefinition(forTerm: self)

Wherever you're getting your OCR text from, you'd use it like this:

let stringParser = StringParser()
let cleanedUpText = stringParser.cleanUpString(yourOCRText)

Here's the class:

import UIKit // need this so UITextChecker will work
import Foundation

class StringParser: NSObject {

    // TODO: You'll need to create a singleton class for your first and last names
    // https://krakendev.io/blog/the-right-way-to-write-a-singleton

    func cleanUpString(_ inputString: String) -> String {

        // chuck stuff separated by a space into an array as an invdividual string
        let inputStringArray = inputString.characters.split(separator: " ").map(String.init)

        var outputArray = [String]()

        for word in inputStringArray {
            // Switch to check if word satisfies any of the desired conditions...if so, chuck in outputArray

            switch word {
            case _ where word.isRealWord():
                outputArray.append(word)
                break
            case _ where word.isFirstName():
                outputArray.append(word.capitalized)
                break
            case _ where word.isLastName():
                outputArray.append(word.capitalized)
                break
            default:
                break
            }
        }

        // reassemble the cleaned up words into an output array and return it as a single string
        return outputArray.joined(separator: " ")
    }
}

extension String {

    func isFirstName() -> Bool {
        let firstNameArray = ["Andrew", "Sharon"] // FIXME: this should be your singleton

        return firstNameArray.contains(self.capitalized)
    }

    func isLastName() -> Bool {
        let lastNameArray = ["Webster", "Hazelwood"] // FIXME: this should be your singleton

        return lastNameArray.contains(self.capitalized)
    }

    func isRealWord() -> Bool {
        // adapted from https://www.hackingwithswift.com/example-code/uikit/how-to-check-a-string-is-spelled-correctly-using-uitextchecker
        let checker = UITextChecker()
        let range = NSRange(location: 0, length: self.utf16.count)
        let misspelledRange = checker.rangeOfMisspelledWord(in: self, range: range, startingAt: 0, wrap: false, language: "en")

        if misspelledRange.location == NSNotFound {
            // cleans up what UITextChecker misses
            return UIReferenceLibraryViewController.dictionaryHasDefinition(forTerm: self) // returns yes if there's a definition for it
        }
        return false
    }
}
Adrian
  • 16,233
  • 18
  • 112
  • 180
  • This will work on a physical device, but the compiler will say there's no dictionary if you run it in simulator. – Adrian Dec 24 '16 at 20:37
  • Thanks @Adrian! I really appreciate you taking the time. I'm testing on a playground, should it work? Doesn't seem like the dictionary is working as it's not returning true on english words? – GarySabo Dec 25 '16 at 04:57
  • @GarySabo I could only get 'UIReferenceLibraryViewController' to work on a physical device – Adrian Dec 25 '16 at 09:21
  • Thanks @Adrian, finally had a chance to try on device and it works pretty well! I'd like to get something eventually I could test on the simulator but this is a great solution. Thanks so much! – GarySabo Dec 26 '16 at 22:02
  • Cool! This was one of those rare occasions where the OP was trying to do something that's close to something I was working on. If you figure out how to get `UIReferenceLibraryViewController`'s dictionary in simulator, feel free to update the answer and I'll do the same. – Adrian Dec 27 '16 at 00:05