I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.
As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:
12 New York
24 Hew York
33 New Yobk
and other "close" matches (in the sense of the Levenshtein distance), but not:
aa New York
11 Detroit
Obviously, I will need to specify the allowable distance ("fuzziness") for the match.
As I understand it, I cannot use the String::Approx
Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).
Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.
Edited to add:
Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:
ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.
What this actually says is:
ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE
What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:
/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i
where the matching is fuzzy.