Compare 2 arrays of string for matches in C optimization

Question

I am having a perl script that has 2 arrays, 1 with keys and 1 with substring. I need to check if substring of 1 array have matches in the keys array. The amount of records is huge, something that can be counted in millions so I use Inline:C to speed up the search, however it is still taking hours to treat the records.

--Perl part

//%h contains {"AAAAA1" => 1, "BBBBBB" => 1, "BB1234" =>1, "C12345" => 1.... }
my @k=sort keys %h;
//@k contains ["AAAAA1", "BBBBBB", "BB1234", "C12345".... ]
my @nn;
//@n contains [ "AAAAA1999", "AAAAABBB134", "D123edae", "C12345CCSAER"]
// "AAAAA1" (from @k) can be found in "AAAAA1999" (in @n) = OK
foreach(@n) {
        my $res=array_search(\@k,$_);
        if($res) {
                $y++;
        } else {
                $z++;
                push @nn,$_;
        }
}

--C part

int fastcmp ( char *p1, char *p2 ) {
  while( *p1 ){
    char *a = p1, *b = p2;    
    if (*b != *a) return 0;
    ++p1; ++b;
  }
  return 1;
}

int array_search(AV *a1, SV *s1){
        STRLEN bytes1;
        char *p1,*p2,*n;
        long a1_size,i,c;
        a1_size = av_len(a1);
        p1 = SvPV(s1,bytes1);        
        for(i=start;i<=a1_size;++i){
            SV** elem = av_fetch(a1, i, 0);
            SV** elem_next = (i<a1_size-1)?av_fetch(a1, i+1, 0):elem;
            p2 = SvPV_nolen (*elem);
            n = SvPV_nolen (*elem_next);
            if (p1[0] == p2[0]) {
                if (fastcmp(p1,p2)>0) {
                    return i; 
                }
            }
            if ((p1[0] == p2[0]) && (p2[0] != n[0])) { return -1; }
        }
        return -1;
}

If somebody could help to optimize the search, that could be nice. Thanks.

Note: added comments to help what is inside each variables.

use something like a table-lookup(hash-table), hashing will be much faster than searching for an element in a large array — tesseract, Feb 21 '14 at 00:31
if you mean hash of strings in perl, i cant as I need to look for substring. In case of a perfect match, I agree it is much much faster. — user2360915, Feb 21 '14 at 00:37
If you're up to millions of records maybe its time to use a database for this. mysql or sqlite. Then you can also assign indices and e.g. try optimizing things to make things faster, its probably hard to say without knowing more. — Brandin, Feb 21 '14 at 01:01
Brandin, thanks for trying to understand the problem. The data I am treating is coming from files that are gzipped. Not much option for the database as the script will run on a live system. Cant add something new there except a script. Looks like you are almost saying it is normal it is slow with millions of records :) — user2360915, Feb 21 '14 at 01:15
Your C code is rather obscure, but as it stands I believe you can replace `count += !*b` with just `return 1` as there is no point in finding *all* of the matches for a pair of strings. — Borodin, Feb 21 '14 at 01:16
So you're checking whether any of the values in `@k` appear as substrings of any of the values in `@n`? What sizes are `@k` and `@n`? — Borodin, Feb 21 '14 at 01:23
Your comment seems to be wrong. `AAAAA1999` appears in `@n` (not in `@k`) and `AAAAA1` appears in `@k` (not in `@n`). Is it the comment that is wrong or the data? — Borodin, Feb 21 '14 at 01:29
Sorry, you are right. The comment is wrong not the data. @k = 1284038 & @n= 361246 — user2360915, Feb 21 '14 at 01:35
I changed the count into a return 1. It is faster, but not enough. 15min to treat 5% of the data. — user2360915, Feb 21 '14 at 01:50
@user2360915: Do the values in the array always *begin* with the hash keys, or can the keys appear *anywhere* in the strings? For instance, would `B12` match `BB1234`? — Borodin, Feb 21 '14 at 02:40

ikegami · Accepted Answer · 2014-02-21T21:17:53.720

2

The implementation you have fails in many ways:

Fails for @a=chr(0xE9); utf8::upgrade($x=$a[0]); array_search(\@a, $x);
Fails for "abc"=~/(.*)/; array_search(["abc"], $1);
Fails for array_search(["a\0b"], "a\0c");

It also incorrectly assumes the strings are null-ternminated, which can lead to a SEGFAULT when they aren't.

Your approach scans @k for each element of @n, but if you build a trie (as the following code does), it can be scanned once.

my $alt = join '|', map quotemeta, keys %h;
my $re = qr/^(?:$alt)/;

my @nn = sort grep !/$re/, @n;
my $z = @nn;
my $y = @n - @nn;

For example, if there are 1,000 Ns and 1,000 Hs, your solution does up to 1,000,000 comparisons and mine does 1,000.

Note that 5.10+ is needed for the regex optimisation of alternations into a trie. Regexp::List can be used on older versions.

A proper C implementation will be a little faster because you can do a trie search using a function that does just that rather than using the regex engine.

edited Feb 21 '14 at 21:17

answered Feb 21 '14 at 05:45

ikegami

367,544
15
269
518

Ikegami san, thanks. Do you really think using regex with millions of records is efficient? I mean joining the keys into a regex and using grep after, looks to me it is going to take a loooong time to give back a result. – user2360915 Feb 21 '14 at 06:36
Also, above expression will return all the matches, I just want to stop at the first match. Probably a trie search in C is maybe my best option. – user2360915 Feb 21 '14 at 07:01
@user2360915: ikegami's method *does* do a trie search in C: that is what the regex engine does. But it checks only whether the keys match at the *start* of the values in `@n`, which is much faster than finding a match *anywhere*. That is why I asked my last question. – Borodin Feb 21 '14 at 09:19
If N is 1,000 and K is 1,000, your solution does an average of 500,000 comparisons and mine does 1,000. Yup, I think it's quite efficient. – ikegami Feb 21 '14 at 12:46
Your solution finds all all matching Ns, so why do you say you just want the first? – ikegami Feb 21 '14 at 12:46
@Hynek -Pichi- Vychodil, the OP said the string always begin with the key. Reverting edit. – ikegami Feb 21 '14 at 14:48
I see, it's last comment. I have not noticed it. – Hynek -Pichi- Vychodil Feb 21 '14 at 21:14
Yes, the fastcmp function is not correct and should look only from the beginning and not take all matches. I changed it but still taking a long time. Ie. 1 call to array_search takes 0.013 sec. So, multiplied by millions it is obviously slow. – user2360915 Feb 24 '14 at 02:51
Re "the fastcmp function is not correct and should look only from the beginning" uh, that's what it does. `fastcmp` only matches from the beginning. (Well, only tries to match from the beginning, but it does `++p1` instead of `++a`, along with incorrectly assuming the strings are always going to be NUL-terminated.) – ikegami Feb 24 '14 at 03:04
@user2360915, So, is your question answered? If so, accept an answer. If not, what problems are you still facing? – ikegami Feb 24 '14 at 03:11
Still struggling with performance. In the case I use your regex, since it returns all records matching (return "AAAAA" and "AAAAB" for a match with "AAA") but I want only the first match "AAAAA", how would you change your regex? – user2360915 Feb 24 '14 at 04:06
You can remove entries from the trie once you find a match for them. Or you can change the value for that key in the trie. – ikegami Feb 24 '14 at 04:15
Ok. Good, that way I get an elapsed time of 7e-06 sec. Dont need C to have good performance. – user2360915 Feb 24 '14 at 04:44
What do you mean? You rebuild the regex every time you find a match? That's very expensive, but looks like it doesn't matter! – ikegami Feb 24 '14 at 05:23
No no, I built it just once. I meant after following your recommendations, I got excellent performance without the need of "C" programming. – user2360915 Mar 07 '14 at 00:59
How did you remove from the trie without rebuilding regex? – ikegami Mar 07 '14 at 03:53
To build the regex with your idea : my $alt = join '|', map quotemeta, sort keys %h; my $re = qr/^($alt)/; Then once found, I set the key/value to something else, working perfectly. – user2360915 Mar 07 '14 at 23:41

Compare 2 arrays of string for matches in C optimization

1 Answers1