0

How to find if a string is present with one or two mismatch in another string?

my $find = "MATCH";
my $search = "stringisMATTHhere";

# $search has one mismatch: MATTH
# for exact match, this one seems working
if   ($search =~ /$find/){
       print "String found";
     }
else {
       print "String not found";
     }

How can I solve this issue with one mismatch: MSTCH, AATCH, MACCH, etc. and two mismatches: ATTCH, MGGCH, etc

SSh
  • 179
  • 2
  • 13
  • 3
    you need edit distance and not regex for this task – rock321987 Apr 27 '16 at 19:24
  • this question was closed as a duplicate of https://stackoverflow.com/questions/4155840/fuzzy-regular-expressions, but that question doesn't deal with searching in a larger string. – ysth Apr 27 '16 at 19:47
  • are mismatches just changed characters, not added/deleted/swapped ones? – ysth Apr 27 '16 at 19:49
  • how long are the strings you are actually looking for? – ysth Apr 27 '16 at 19:50
  • searching for a string (around 20 characters) if present in another string (1000 characters), check for partial string if present- just one or two positions anywhere in 20 character string's position (no deletion, addition, exact 20 characters) – SSh Apr 27 '16 at 20:18

4 Answers4

2

As far as I know there is only one convenient solution using a special REGEX engine: https://metacpan.org/pod/re::engine::TRE.

Here the solution for your example:

#!/usr/bin/perl

use strict;
use warnings;

use re::engine::TRE max_cost => 2;

my $find = "MATCH";
my $search = "stringisMATTHhere";

if ($search =~ /\($find\)/) {
    print $1,"\n";
}

This outputs:

$ perl fuzzy_re.pl 
MATTH
  • Not quite. `MTCH` would match, but the OP mentioned in a comment that only substitutions are allowed, not insertions or deletions. I think your solution can be fixed by using `use re::engine::TRE max_cost => 2, max_ins => 0, max_del => 0;` instead or `use re::engine::TRE max_cost => 2;`. – ikegami May 20 '16 at 21:04
  • @ikegami I tried `use re::engine::TRE max_cost => 2, max_ins => 0, max_del => 0;` but `MTCH` matched too – SSh May 25 '16 at 02:42
  • @SSh Seems like a bug. This works: `use re::engine::TRE max_cost => 2, cost_ins => -1, cost_del => -1;` – Helmut Wollmersdorfer Dec 25 '16 at 07:45
2

So you want to do

/
   ..TCH | .A.CH | .AT.H | .ATC. |
   M..CH | M.T.H | M.TC. | 
   MA..H | MA.C. |
   MAT..
/x

or

/
   \w\wTCH | \wA\wCH | \wAT\wH | \wATC\w |
   M\w\wCH | M\wT\wH | M\wTC\w | 
   MA\w\wH | MA\wC\w |
   MAT\w\w
/x

Easy enough:

my @subpats;
for my $i (0..length($find)-1) {
   for my $j ($i+1..length($find)-1) {
      my $subpat = join('',
         substr($find, 0, $i),
         '.',  # or '\\w'
         substr($find, $i+1, $j-$i-1),
         '.',  # or '\\w'
         substr($find, $j+1),
      );
      push @subpats, $subpat;
   }
}

my $pat = join('|', @subpats);

$search =~ /$pat/

Perl 5.10+ trie-based alternations should optimize the common leading prefixes into something efficient. Saves us the trouble of generating (?:.…|M…).

ikegami
  • 367,544
  • 15
  • 269
  • 518
2

If the searched string should have the same length (i.e. only mismatches allowed) as stated in a later comment, you can use Hamming distance, which is very fast:

#!/usr/bin/perl

use strict;
use warnings;

my $find = "MATCH";
my $search = "stringisMATTHhere";

my $max_distance = 2;

for my $offset (0..length($search)-length($find)) {
  my $hd = hd($find,substr($search,$offset,length($find)));
  if ($hd <= $max_distance) {
    print substr($search,$offset,length($find)),"\n";
  }
}

# assumes byte mode
sub hd {
  return ($_[0] ^ $_[1]) =~ tr/\001-\255//;
}
0

I got re-interested in this so I thought I'd try something with a little more
variability in a controlled way.

Features:
- Can set a min/max mismatch range for individual find's.
- Can set a flag to exclude/include space 0x20 or less in the mismatch count.
- Automatically escape meta-chars in find's.

That's it.
Good Luck!!


Regex:

 (?s)
 (?{ $cnt = 0; $lcnt = 0 })
 (?:
      (?>
           (??{ $aryinput[$lcnt++] })
        |  (?&getexpr) 
      )
 ){$len}
 (??{ $cnt >= $mincnt && $cnt <= $maxcnt ? '' : '(?!)' })

 (?(DEFINE)
      (?<getexpr>
           (??{ ++$cnt <= $maxcnt ? 
                 ( $visible_only ? 
                      ( $aryinput[$lcnt-1] le ' '  ?
                           '(?!)'
                        :  '[^\x{0}-\x{20}]'
                      )
                   :  '.'
                 )
              :  '(?!)'
           })
      )
 )

Perl code:

use strict;
use warnings;

my $target = 
   "
   one mismatch: MSTCH, AATCH, MACCH, etc. and two mismatches: ATTCH, MGGCH,
   MA1CH T23S  
   M.1CH T23S  
   MAT1 H2T3IS
   0M[T2CH  THaS
   0M[T2CH THaS
   MA1CH    THIS
   MATCH    THIS
   MATCHT1IS
   MA1CH THIS
   MAT1H THIb
   MATCH THIS
   MArCH THIS
   AATCH THIS
   [()+?.*{}|]
   [()X?.*{}|]
   [()+?.SS}|]
   ";

my @aryinput = ();
my ($rx, $find, $visible_only, $len, $cnt, $mincnt, $maxcnt, $lcnt) = ('', '',0,0,0,0,0,0);

my @TestRuns = (
    { find => 'MATCH THIS', visible => 1, min => 0, max => 3 },
    { find => 'MATCH', visible => 1, min => 0, max => 3 },
    { find => 'MATCH THIS', visible => 0, min => 0, max => 3 },
    { find => 'MATCH', visible => 0, min => 2, max => 3 },
    { find => 'MATCH', visible => 0, min => 1, max => 1 },
    { find => '[()+?.*{}|]', visible => 1, min => 1, max => 3 },
  );

for ( @TestRuns )
{
   GetParms( $_ );
   SetFindArray( $find );

   print "\nFind($len), ", ($visible_only ? "not counting control char" : "counting any char"), ", minmax($mincnt,$maxcnt):\n'$find'\n";
   while( $target =~ /$rx/g )
   {
       print "    cnt($cnt) : '$&'\n";
   }
}

# ==================================  
# ================================== 

sub GetParms
{
   my ($href) = @_;
   ( $find, $visible_only, $mincnt, $maxcnt ) = 
   ( $$href{find}, $$href{visible}, $$href{min}, $$href{max} );
}
sub SetFindArray
{
    my ($inp) = @_;
    @aryinput = ();
    @aryinput = map {  s/([\\().?*+{}|\[\]])/\\$1/; $_  } split '', $inp;
    $len = @aryinput;
    $rx = qr/(?s)(?{ $cnt = 0; $lcnt = 0 })(?s)(?:(?>(??{ $aryinput[$lcnt++] })|(?&getexpr))){$len}(??{ $cnt >= $mincnt && $cnt <= $maxcnt ? '' : '(?!)' })(?(DEFINE)(?<getexpr>(??{ ++$cnt <= $maxcnt ? 
                     ( $visible_only ? 
                          ( $aryinput[$lcnt-1] le ' '  ?
                               '(?!)'
                            :  '[^\x{0}-\x{20}]'
                          )
                       :  '.'
                     )
                  :  '(?!)'
               })))/;
}

Output:

Find(10), not counting control char, minmax(0,3):
'MATCH THIS'
    cnt(3) : 'MA1CH T23S'
    cnt(1) : 'MA1CH THIS'
    cnt(2) : 'MAT1H THIb'
    cnt(0) : 'MATCH THIS'
    cnt(1) : 'MArCH THIS'
    cnt(1) : 'AATCH THIS'

Find(5), not counting control char, minmax(0,3):
'MATCH'
    cnt(1) : 'MSTCH'
    cnt(1) : 'AATCH'
    cnt(1) : 'MACCH'
    cnt(2) : 'ATTCH'
    cnt(2) : 'MGGCH'
    cnt(1) : 'MA1CH'
    cnt(2) : 'M.1CH'
    cnt(3) : 'M[T2C'
    cnt(3) : 'M[T2C'
    cnt(1) : 'MA1CH'
    cnt(0) : 'MATCH'
    cnt(0) : 'MATCH'
    cnt(1) : 'MA1CH'
    cnt(1) : 'MAT1H'
    cnt(0) : 'MATCH'
    cnt(1) : 'MArCH'
    cnt(1) : 'AATCH'

Find(10), counting any char, minmax(0,3):
'MATCH THIS'
    cnt(3) : 'MA1CH T23S'
    cnt(2) : 'MA1CH     THIS'
    cnt(1) : 'MATCH     THIS'
    cnt(1) : 'MA1CH THIS'
    cnt(2) : 'MAT1H THIb'
    cnt(0) : 'MATCH THIS'
    cnt(1) : 'MArCH THIS'
    cnt(1) : 'AATCH THIS'

Find(5), counting any char, minmax(2,3):
'MATCH'
    cnt(3) : ' ATTC'
    cnt(2) : 'MGGCH'
    cnt(2) : 'M.1CH'
    cnt(2) : 'MAT1 '
    cnt(3) : 'M[T2C'
    cnt(3) : 'M[T2C'

Find(5), counting any char, minmax(1,1):
'MATCH'
    cnt(1) : 'MSTCH'
    cnt(1) : 'AATCH'
    cnt(1) : 'MACCH'
    cnt(1) : 'MA1CH'
    cnt(1) : 'MA1CH'
    cnt(1) : 'MA1CH'
    cnt(1) : 'MAT1H'
    cnt(1) : 'MArCH'
    cnt(1) : 'AATCH'

Find(11), not counting control char, minmax(1,3):
'[()+?.*{}|]'
    cnt(1) : '[()X?.*{}|]'
    cnt(2) : '[()+?.SS}|]'