3

Let's say I have a set of 100,000 strings, each one around 20-50 bytes on average.

Then let's say I have another set of 100,000,000 strings, each one also around 20-50 bytes on average.

I'd like to go through all 100,000,000 strings from the second set and check if any one of the strings from the first set exist in any one string from the second set.

Example: string from first set: "abc", string from second set: "123abc123" = match!

I've tried using Perl's index(), but it's not fast enough. Is there a better way to do this type of matching?

Friedrich
  • 41
  • 3
  • See also [Fastest way to find lines of a file from another larger file in Bash](https://stackoverflow.com/q/42239179/2173773) – Håkon Hægland Apr 14 '20 at 20:34
  • You could sort the 100,000 strings to cluster them into groups with identical 4-char prefix. Then, for each prefix, scan the 100,000,000 strings for those containing the prefix. Do a full index() only for strings which survived the first test.Experiment with the prefix length. – Axel Kemper Apr 14 '20 at 20:54
  • 1
    sample data would sure help. Is this words text? Random data? AGACT? – ysth Apr 14 '20 at 20:57
  • @ysth is asking because an algorithm that accepts any character will have to give up speed in order to achieve the objective without using too much memory. – ikegami Apr 14 '20 at 21:17
  • that and wanting to come up with my own data to run benchmarks – ysth Apr 14 '20 at 21:44
  • I haven't used it, but the [Algorithm::AhoCorasick::XS](https://metacpan.org/pod/Algorithm::AhoCorasick::XS) module looks like it has potential. – Shawn Apr 14 '20 at 22:40
  • 3
    Or skip perl and go straight to `grep -Ff set1.txt set2.txt` – Shawn Apr 14 '20 at 22:43
  • "it's not fast enough." Just how fast do you expect this to be? – Jim Mischel Apr 15 '20 at 14:58

5 Answers5

4

I found Algorithm::AhoCorasik::XS on CPAN, which implements the classic, very efficient multiple string search Aho-Corasick algorithm (Same one used by grep -F), and it seems to be reasonably fast (About half the speed of an equivalent grep invocation):

Example script:

#!/usr/bin/env perl
use warnings;
use strict;
use autodie;
use feature qw/say/;
use Algorithm::AhoCorasick::XS;

open my $set1, "<", "set1.txt";
my @needles = <$set1>;
chomp @needles;
my $search = Algorithm::AhoCorasick::XS->new(\@needles);

open my $set2, "<", "set2.txt";
while (<$set2>) {
    chomp;
    say if defined $search->first_match($_);
}

and using it (With randomly-generated test files):

$ wc -l set1.txt set2.txt
  10000 set1.txt
500000 set2.txt
510000 total
$ time perl test.pl | wc -l
458414

real    0m0.403s
user    0m0.359s
sys     0m0.031s
$ time grep -Ff set1.txt set2.txt | wc -l
458414

real    0m0.199s
user    0m0.188s
sys     0m0.031s
Shawn
  • 47,241
  • 3
  • 26
  • 60
3

You should use a regex alternation, like:

my @string = qw/abc def ghi/;
my $matcher = qr/@{[join '|', map quotemeta, sort @string]}/;

This should be faster than using index. But it can be made faster yet:

Up to a certain limit, depending on both the length and number of the strings, perl will build a trie for efficient matching; see e.g. https://perlmonks.org/?node_id=670558. You will want to experiment with how many strings you can include in a single regex to generate an array of regexes. Then combine those separate regexes into a single one (untested):

my @search_strings = ...;
my @matchers;
my $string_limit = 3000; # a guess on my part
my @strings = sort @search_strings;
while (my @subset = splice @strings, 0, $string_limit) {
    push @matchers, qr/^.*?@{[join '|', map quotemeta, sort @subset]}/s;
}
my $matcher = '(?:' . join('|', map "(??{\$matchers[$_]})", 0..$#matchers) . ')';
$matcher = do { use re 'eval'; qr/$matcher/ };
/$matcher/ and print "line $. matched: $_" while <>;

The (??{...}) construct is needed to join the separate regexes; without it, the subregexes are all just interpolated and the joined regex compiled all together, removing the trie optimization. Each subregex starts with ^.*? so it searches the entire string; without that, the joined regex would have to invoke each subregex separately for each position in the string.

Using contrived data, I'm seeing about 3000 strings searched per second with this approach in a not very fast vm. Using the naive regex is less than 50 strings per second. Using grep, as suggested in a comment by Shawn, is faster (about 4200 strings per second for me) but gives you less control if you want to do things like identify which strings matched or at what positions.

ysth
  • 96,171
  • 6
  • 121
  • 214
  • I'd also merge a bunch of target lines to reduce the number of times the engine is started -- they got a 100M lines to search! (I recall that I once may have found a limit of 32kB for regex strings?) – zdim Apr 15 '20 at 05:33
0

You may want to have a look at https://github.com/leendo/hello-world . Its parallel processing makes it really fast. Either just type in all search terms individually or as || conjunctions, or (better) adapt it to run the second set programatically in one go.

0

Here is an idea: you could partition the dictionary into lists of words that have the same 2 or 3 letter prefix. You would then iterate on the large set and for each position in each string, extract the prefix and try and match the strings that have this prefix.

You would use hashtable to store the lists with O(1) lookup time.

If some words in the dictionary are shorter than the prefix length, you would have to special case short words.

Making prefixes longer will make the hashtable larger but the lists shorter, improving the test time.

I have no clue if this can be implemented efficiently in Perl.

chqrlie
  • 131,814
  • 10
  • 121
  • 189
-1

The task is quite simple and perhaps used on everyday basis around the globe.

Please see following code snippet for one of many possible implementations

use strict;
use warnings;
use feature 'say';

use Getopt::Long qw(GetOptions);
use Pod::Usage;

my %opt;
my $re;

GetOptions(
            'sample|s=s'    => \$opt{sample},
            'debug|d'       => \$opt{debug},
            'help|h'        => \$opt{help},
            'man|m'         => \$opt{man}
    ) or pod2usage(2);

pod2usage(1) if $opt{help};
pod2usage(-verbose => 2) if $opt{man};

pod2usage("$0: No files given.")  if ((@ARGV == 0) && (-t STDIN));

$re = read_re($opt{sample});

say "DEBUG: search pattern is ($re)" if $opt{debug};

find_in_file($re);

sub find_in_file {
    my $search = shift;

    while( <> ) {
        chomp;
        next unless /$search/;
        say;
    }
}

sub read_re {
    my $filename = shift;

    open my $fh, '<', $filename
        or die "Couldn't open $filename";

    my @data = <$fh>;

    close $fh;

    chomp @data;

    my $re = join('|', @data);

    return $re;
}

__END__

=head1 NAME

file_in_file.pl - search strings of one file in other 

=head1 SYNOPSIS

yt_video_list.pl [options] -s sample.txt dbfile.txt

 Options:
    -s,--sample search pattern file
    -d,--debug  debug flag
    -h,--help   brief help message
    -m,--man    full documentation

=head1 OPTIONS

=over 4

=item B<-s,--sample>

Search pattern file

=item B<-d,--debug>

Print debug information.

=item B<-h,--help>

Print a brief help message and exits.

=item B<-m,--man>

Prints the manual page and exits.

=back

B<This program> seaches patterns from B<sample> file in B<dbfile>

=cut
Polar Bear
  • 6,762
  • 1
  • 5
  • 12
  • 2
    the task is simple; the complex part is making it fast – ysth Apr 14 '20 at 22:44
  • @ysth -- well perl somehow has to go through all search patterns, or through loop, or through `pat1|pat2|pat3|....|patn`. I do not know how implemented last form of match -- but I would expect that some effort was put forward to optimize it and I hope that it will stop at first match found. There is a chance that some optimization can be achieved if patterns get sorted before `join('|',@patterns)` -- but not warranted. – Polar Bear Apr 14 '20 at 22:48
  • 1
    Doesn't work if the input contains any of `\.+*(){}[]` and maybe others. – ikegami Apr 15 '20 at 00:27