4

I have array like this:

array('1224*', '543*', '321*' ...) which contains about 17,00 "masks" or prefixes.

I have a second array:

array('123456789', '123456788', '987654321' ....) which contain about 250,000 numbers.

Now, how can I efficiently match every number from the second array using the array of masks/prefixes?

[EDIT]

The first array contains only prefixes and every entry has only one * at the end.

mickmackusa
  • 43,625
  • 12
  • 83
  • 136
canni
  • 5,737
  • 9
  • 46
  • 68
  • Perhaps using a trie could work well in this situation. – mfonda Sep 20 '11 at 14:23
  • @mfonda my english isn't so good, I can't understand what can/should I use? – canni Sep 20 '11 at 14:38
  • see http://en.wikipedia.org/wiki/Trie – mfonda Sep 20 '11 at 14:40
  • 1
    17k masks, 250k numbers. In arrays? Wouldn't you be better of handling this in a different language than PHP, or even in a database? I can't imagine PHP will do this quickly, however you do it. – Berry Langerak Sep 20 '11 at 14:44
  • How large is the largest mask (in chars). How large is the largest number (in chars)?. Do you need all results at once? And: What have you tried so far and into which concrete problems did you run? – hakre Sep 20 '11 at 15:51
  • the largest mask have 6 chars (without the `*`) and all numbers are 9 chars long – canni Sep 20 '11 at 15:58
  • @BerryLangerak doing this just in a "hardcore" two `for` loops takes 3 seconds, on my laptop, and this will be used on a high-end server – canni Sep 20 '11 at 16:00

5 Answers5

4

Well, here's a solution:

Prelimary steps:

  1. Sort array 1, cutting off the *'s.

Searching:

  1. For each number in array 2 do
    1. Find the first and last entry in array 1 of which the first character matches that of number (binary search).
    2. Do the same for the second character, this time searching not the whole array but between first and last (binary search).
    3. Repeat 2 for the nth character until a string is found.

This should be O(k*n*log(n)) where n is the average number length (in digits) and k the number of numbers.


Basically this is a 1 dimensional Radix tree, for optimal performance you should implement it, but it can be quite hard.

orlp
  • 112,504
  • 36
  • 218
  • 315
  • Binary search might be a very good idea indead :) but will it handle situations that masks can be from 2, to 6 chars long? (test numbers are always 9 char long) – canni Sep 20 '11 at 14:24
  • @canni: Yes, assuming that the shortest match is always the best and that the wildcard always comes last (so `123*` but not `12*34`). – orlp Sep 20 '11 at 14:27
  • masks are always end with `*` maybe I wrote this wrong, array 1 contains only prefixes, and those prefixes are always unique ie. there is no overlapping (when `123*` => no prefix `1234*`) – canni Sep 20 '11 at 14:31
2

My two cents....

$s = array('1234*', '543*', '321*');
$f = array('123456789', '123456788', '987654321');

foreach ($f as $haystack) {
    echo $haystack."<br>";
    foreach ($s as $needle) {
        $needle = str_replace("*","",$needle);
        echo $haystack "- ".$needle.": ".startsWith($haystack, $needle)."<br>";
    }
}

function startsWith($haystack, $needle) {
    $length = strlen($needle);
    return (substr($haystack, 0, $length) === $needle);
}

To improve performance it might be a good idea to sort both arrays first and to add an exit clause in the inner foreach loop.

By the way, the startWith-function is from this great solution in SO: startsWith() and endsWith() functions in PHP

Community
  • 1
  • 1
Bjoern
  • 15,934
  • 4
  • 43
  • 48
  • This is a nice initial implementation, but `startsWith` will become way too slow, turning this into a `O(k*n*n)` algorithm. – orlp Sep 20 '11 at 14:29
  • I totally agree with you. There are obviously alot of ways to optimize this, and several different approaches for this kind of problem. I just wanted to show a basic approach for the issue at hand. – Bjoern Sep 20 '11 at 15:24
0

Although regex is not famous for being fast, I'd like to know how well preg_grep() can perform if the pattern is boiled down to its leanest form and only called once (not in a loop).

By removing longer masks which are covered by shorter masks, the pattern will be greatly reduced. How much will the reduction be? of course, I cannot say for sure, but with 17,000 masks, there are sure to be a fair amount of redundancy.

Code: (Demo)

$masks = ['1224*', '543*', '321*', '12245*', '5*', '122488*'];
sort($masks);
$needle = rtrim(array_shift($masks), '*');
$keep[] = $needle;
foreach ($masks as $mask) {
    if (strpos($mask, $needle) !== 0) {
        $needle = rtrim($mask, '*');
        $keep[] = $needle;
    }
}
// now $keep only contains: ['1224', '321', '5']

$numbers = ['122456789', '123456788', '321876543234567', '55555555555555555', '987654321'];

var_export(
    preg_grep('~^(?:' . implode('|', $keep) . ')~', $numbers)
);

Output:

array (
  0 => '122456789',
  2 => '321876543234567',
  3 => '55555555555555555',
)
mickmackusa
  • 43,625
  • 12
  • 83
  • 136
0

Another option would to be use preg_grep in a loop:

$masks = array('1224*', '543*', '321*' ...);
$data = array('123456789', '123456788', '987654321' ....);

$matches = array();
foreach($masks as $mask) {
    $mask = substr($mask, 0, strlen($masks) - 2); // strip off trailing *
    $matches[$mask] = preg_grep("/^$mask/", $data);
}

No idea how efficient this would be, just offering it up as an alternative.

Marc B
  • 356,200
  • 43
  • 426
  • 500
-1

Check out the PHP function array_intersect_key.

Tito
  • 71
  • 1
  • 4