-1

I need to comb a large set of data and look for some specific names.
They might appear with syntax errors in the texts.
What solution should I adopt?

common syntax errors:

  1. ernst hmngi
  2. Hurnest Huminguee
  3. Ersnet Henimgway
jogojapan
  • 68,383
  • 11
  • 101
  • 131
Itay Moav -Malimovka
  • 52,579
  • 61
  • 190
  • 278

3 Answers3

3

Those are spelling errors. Regular expressions are not for this kind of task, you should look into Soundex. There is a CPAN module for it:

http://metacpan.org/pod/Text::Soundex

It finds matching words broadly based on phonetics (how the words sound when spoken) in American English.

szabgab
  • 6,202
  • 11
  • 50
  • 64
AlfredoVR
  • 4,069
  • 3
  • 25
  • 33
2

You could look into approximate regexp matching as implemented in e.g. the TRE library. With the TRE tool tre-agrep with different error tolerance values, I can match all variants:

$ cat > test.txt
ernst hmngi
Hurnest Huminguee
Ersnet Henimgway
$ tre-agrep -4 -i "ernest hemingway" test.txt 
Ersnet Henimgway
$ tre-agrep -5 -i "ernest hemingway" test.txt 
Hurnest Huminguee
Ersnet Henimgway
$ tre-agrep -6 -i "ernest hemingway" test.txt 
ernst hmngi
Hurnest Huminguee
Ersnet Henimgway
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
2

Given that

  • you have a dictionary (i.e. a list) of specific names you are looking for
  • you are doing this for English, which can be tokenised in a relatively straight-forward way (e.g. by using white space and punctuation as token boundaries)

the following approach should work well:

  1. Prepare a dictionary of the names in your list
  2. Tokenise the text
  3. Consider a) each token, b) each pair of consecutive tokens, c) each triple of consecutive tokens as candidate names and look them up in the dictionary using approximate string matching techniques.

There are several possible strategies for implementing approximate string matching (I'd recommend trying (E) below first):

A) Methods that reduce the string and all dictionary entries to a canonical form before lookup. Soundex is one such method. The main general problem with these methods is that they do not provide ranking by string similarity, so you might get many different candidates but have no idea which one matches best. Furthermore, the canonical form is based on pronounciation rules for specific languages (e.g. Soundex for English), which is not good for names, especially non-English ones. It is also problematic because the errors you are dealing with are probably caused by mistyping, not mis-pronouncing a name. E.g. using a 'q' instead of a 'w' might be a frequent problem for you, because 'q' and 'w' are located next to each other on the keyboard, while their pronounciation is totally different.

B) Methods that use a search trie to implement the dictionary. During look-up in the trie, you can allow for one or two mismatches and thus find slightly misspelled candidates. The main problem here is that the lookup typically becomes inacceptably slow as soon as you allow for more than 2 character mismatches, in particular when mismatches are allowed at the beginning of strings. There are certain ways to opimise the performance, though. See here for a few ideas.

C) Methods based on n-gram look-up. Here you can use a hash table for the dictionary implementation, but rather than put the names into the hash directly, you split each name into its character n-grams (for predefined n, typically 2 or 3) and put the n-grams into the dictionary. E.g. For

hemingway

you will put

hem
emi
min
ing
ngw
gwa
way

into the hash. During look-up, you do the same with the candidate string, look-up all its n-grams and accept that name that has the highest number of n-grams in common with the input. For example, if the input is hemmgway, you'll find that it has three n-grams (hem,gwa,way) in common with the dictionary entry hemingway.

This method works relatively well if your strings are fairly long and have only a few errors here and there. Maybe also not optimal in your case, but you might want to give it a try.

D) Methods that use Levenshtein automata to implement the dictionary. This is a relatively complicated method, and also has problems when you want to allow for a very large number of errors. A detailed description is found in this paper by Schulz and Mihov. I am unsure whether there is a ready-to-use, open-source implementation available.

E) Methods that combine an implementation of the Levenshtein edit distance function with a metric tree. Given you description I believe this would work best for you, and I have used this method myself in a similar situation. You find further references in the answers to this SO question, and a link to an implementation (which I haven't tried though) in this SO question.

Community
  • 1
  • 1
jogojapan
  • 68,383
  • 11
  • 101
  • 131
  • I find that `agrep -z` is pretty handy for this, if you’re sure what you’re looking for is in the text. – tchrist Mar 06 '12 at 04:04