29

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:

12 New York
24 Hew York
33 New Yobk

and other "close" matches (in the sense of the Levenshtein distance), but not:

aa New York
11 Detroit

Obviously, I will need to specify the allowable distance ("fuzziness") for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits -- sorry about the bad example. Here's a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

pnuts
  • 58,317
  • 11
  • 87
  • 139
itzy
  • 11,275
  • 15
  • 63
  • 96
  • Do you have more examples? This case *can* use String::Approx — just split the string into the number part and "New York" part. – kennytm Nov 11 '10 at 15:16
  • The problem might be that to identify the New York part needs the fuzzy matching... – The Archetypal Paul Nov 11 '10 at 15:18
  • does the non-simplified actual case involve one or more fixed strings like "New York"? how long is this string likely to be? what range of allowable distance do you anticipate using? your example only shows changed characters; do you need to handle extra/missing characters too? – ysth Nov 11 '10 at 15:28
  • If you are doing the OCR yourself you should look into whether you can add a user dictionary. It sounds like you already have a list of important but mis-recognized words. – Ben Jackson Nov 20 '10 at 18:44
  • @Ben: Thanks for the tip. Alas, the OCR was done before I got the data and I only have the text files. – itzy Nov 20 '10 at 20:21
  • Duplicate of http://stackoverflow.com/questions/2351630/fuzzy-regular-expressions ? – Thomas Ahle May 27 '11 at 17:20
  • Hey, thats easy: Use `agrep`: http://www.tgries.de/agrep/ Check its license if you could hijack and use the source. – towi Nov 18 '10 at 13:14
  • I **love** Udi Manber’s `agrep` program! Use it all the time. Udi and his wife Rachel were my professors at university. But I don’t think it’s been ported to Perl. Jarkko had some sort of approximate grep module at one point back in the Middle Ages, but it seems to me it wasn’t at all as good as Udi’s approach. Check CPAN. – tchrist Nov 18 '10 at 13:26
  • @tchrist: What's the problem with calling it as a system binary? Solid, old-school UNIX utility... – Charles Stewart Nov 19 '10 at 08:44
  • @Charles, **I** have no problem with that idea, but many people do. – tchrist Nov 19 '10 at 11:24

10 Answers10

17

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

  • transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
  • choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
  • than, build a qgram-profile of each text:
  • split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
  • if you search for your pattern then, you do the same with your pattern,
  • loop through your text-qgram-database and
    • count for each pattern/text-pair how many qgrams are the same.
    • each hit will raise the score by 1.
  • the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

  • prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
  • You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
  • when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
  • store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

towi
  • 21,587
  • 28
  • 106
  • 187
3

Regexes have specific rules, they aren't built for doing what you want. It's going to be much easier to make two passes at it. Use a regex to strip off the numbers and then use a module to get your match close.

Something like this (assuming your input is lines from a file)

while( my $line = <$fh> ) {
    chomp $line;

    # do we have digits?
    if( $line =~ /^\d+/ ) {
         # removes spaces and digits from the beginning of the line
         $line =~ s/^[\d\s]*//g;

         # use your module to determine if you have a match in the remaining text.
         if( module_match ) {
             # do something
         }
         else {
             #no match
         }
    }
    else {
        # no match
    }
}
Cfreak
  • 19,191
  • 6
  • 49
  • 60
3

You could try using something like Web 1T 5-gram Version 1 and a conditional likelihood maximization approach.

If I recall correctly, Chapter 14 of Beautiful Data is devoted to this data set and how to use it to spot spelling errors etc.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
3

Did you look into using Jarkko’s String::Approx module on CPAN? It has the agrep algorithm in it, but is much slower than Udi’s.

tchrist
  • 78,834
  • 30
  • 123
  • 180
2

Have you considered a two-stage test, using regex to enforce the requirement of [0-9]{2,2} (.*), then capturing the remaining text and doing a fuzzy match on it? Try thinking of the problem as an intersection of a regular expression and a fuzzy string.

Seth Johnson
  • 14,762
  • 6
  • 59
  • 85
2

Separate the problem into two parts:

  1. Match the double-digit number.
  2. Fuzzily match the residue against 'New York'.

In the example, you know that 'New York' consists of 2 words; you might be able to leverage that to eliminate alternatives like 'Detroit' (but not necessarily 'San Francisco') more easily.

You might even be able to use 'String::Approx' after all, though it mentions:

... the Text::Levenshtein and Text::LevenshteinXS modules in CPAN. See also Text::WagnerFischer and Text::PhraseDistance.

(My Perl was unable to find Text::PhraseDistance via CPAN - the others are available and install OK.)

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • those are really useful references, thanks! BTW, how does StackOverflow parse `@somebody` references for notification? Does it just do `/\@(\w+)/` or `/\@(\S+)/` or `/\@([\w\s]+)\s*\pP/` or what? In other words, how the devil does it know where to stop scanning when looking for a username? Is the code available for scanning? – tchrist Nov 11 '10 at 20:14
  • @tchrist: I'm not sure where the parsing is done - probably not in client-side code (so not visible to us). It uses the first three letters of the name and matches that against prior comments, etc. I don't abbreviate to minimize the risk of going to the wrong person. I think if you looked on the SO blog (for a year or more ago), you'd find a discussion about how it does work. – Jonathan Leffler Nov 11 '10 at 20:28
  • Thanks, but I don't think these allow regular expressions. I suppose I could try to use the distance in combination with regular expressions somehow, but frankly I don't see a way that could handle my 2nd example above. – itzy Nov 11 '10 at 21:31
  • @itzy: yes, your requirement is quite a bit more complex than the first edition of your question suggested. I'm assuming that the scanned string is a genuine example (or close enough that I can't tell the difference). You only have 'ASSIGNOR', 'BY', 'TO', 'A', 'OF' words to hang the analysis off - and I'd not be confident that those could not be mis-recognized, either. A lot will hinge on the list of possible values for words. The OF is followed by a state name - in the example. But if a company is "Allen's of Idaho" (a corporation of Delaware, of course), then you get into problems again. – Jonathan Leffler Nov 11 '10 at 22:31
2

Well you can narrow down your candidates with Text::Levenshtein to get the edit distance and grepping by a comparison to the limit.

But another idea is that you can take the correct form and create a hash keyed from near-misses pointing to the proper form so that those might become candidates as well.

For regexes, you possibly would have to use the experimental code sections, perhaps something like this:

m/ (?i: [new] | \p{Alpha} (?{ $misses++ }) ){2,4}
   \s+
  (?i: [york] | \p{Alpha} (?{ $misses++ }) ){3,5}
 /x

Although in this case, you'd probably have to have a regex per proper value. You probably want some flag indicating when you missed your target.

Axeman
  • 29,660
  • 2
  • 47
  • 102
  • In Perl, 'grep' has a specific meaning (loosely related to the Unix 'grep') that seems unlikely to be helpful in this context. You probably need to rephrase your answer. – Jonathan Leffler Nov 11 '10 at 15:25
2

Rule of thumb: When you have to go to Stack Overflow and ask "How can I do X in a single regex?" you should consider doing X with more than just a single regex.

Based on your edits, I would do something like this:

while(<>) {
  chomp;
  if(/assignor, by (\w+) (\w+), to (\w+), a (\w+) of (\w+)/i) {
    # now use String::Approx to check that $1, $2, $3, $4, and $5 match
  } else {
    warn "Errors!\n";
  }
}

I'm not giving you everything here. I didn't make the ", by (\w+) (\w+)" bit optional to simplify the regex so you could get the gist of it. To do that you'll probably need to resort to named captures and the (?:) non-capturing group. I didn't feel like delving into all that, just wanted to help you understand how I would approach this.

Remember: If you have to ask "How do I do it all in a single regex?" you should stop trying to do it all in a single regex.

Chris Lutz
  • 73,191
  • 16
  • 130
  • 183
  • Also, you might want to check the spelling of "assignor", but that should be trivially easy to add if you get the rest of it. Also, you might want to change all the spaces in the regex to `\s+` (or maybe ` +`) in case someone accidentally typed two spaces. – Chris Lutz Nov 15 '10 at 01:43
  • Thanks. I actually am not looking for something that can be done in one line. I wrote the pattern like that to give a sense of what it will generally look like, but I will take any approach. I'd like to follow your suggestion, but the basic issue is that there could be a spelling mistake anywhere -- there's not guarantee that "buy" or "to" would be right. I want to have a match that allows for some errors (fuzziness) anywhere, but not too much. I think that's what makes this so difficult. I was hoping someone would know of some package that would be well-suited to this type of problem. – itzy Nov 16 '10 at 01:37
1

Although you specified perl, there is a useful algorithm built into R that implements Levenshtein edit distances.

agrep()

This command also allows the use of any regular expression or pattern to match. I would recommend you look at it. http://stat.ethz.ch/R-manual/R-devel/library/base/html/agrep.html

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
1

Python regex module provide a way to do fuzzy matching within regexes:

https://pypi.org/project/regex/ (look for Approximate “fuzzy” matching)

The fuzziness of a regex item is specified between “{” and “}” after the item.

Examples:

foo match “foo” exactly
(?:foo){i} match “foo”, permitting insertions
(?:foo){d} match “foo”, permitting deletions
(?:foo){s} match “foo”, permitting substitutions
(?:foo){i,s} match “foo”, permitting insertions and substitutions
(?:foo){e} match “foo”, permitting errors
If a certain type of error is specified, then any type not specified will not be permitted.

In the following examples I’ll omit the item and write only the fuzziness:

{d<=3} permit at most 3 deletions, but no other types
{i<=1,s<=2} permit at most 1 insertion and at most 2 substitutions, but no deletions
{1<=e<=3} permit at least 1 and at most 3 errors
{i<=2,d<=2,e<=3} permit at most 2 insertions, at most 2 deletions, at most 3 errors in total, but no substitutions

So you could write, eg:

import regex, pprint

m = regex.compile( r'(?:Assignor(, by mesne assignments,)? to (company name), a corporation of (state)){e}', regex.IGNORECASE ).match('ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.')

pprint.pprint(m)
pprint.pprint(m.groups())

This does not work right away, the result would be:

<regex.Match object; span=(0, 71), match='ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY,', fuzzy_counts=(45, 0, 0)>
(', BY MESHS ASSIGN1IBNTS', ' ALUSCHALME&', 'PANY,')

But giving it some more tweaking (eg you could specify a maximum number of errors for each capture group) you should be able to reach you goal.

Loubregand
  • 11
  • 3