18

I have 2 very large arrays (of size ~2,500,000). I need to find difference between these arrays. By difference I mean I need a resultant array with values that are in array 1 but not in array 2. I have used array_diff() but it takes more than half an hour!

The first array is coming from one DB and second array from another db. They aren't on the same Database server. The arrays aren't of same size. I am dealing with huge number of mobile numbers. I need to find out those mobile numbers which are in one list, but aren't in another list

arrays are normal arrays with numeric keys. Diff code is as follows:

$numbers_list = array_diff($numbers_list, $some_other_list);

Is there a better way of doing this? Please help.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
Amar
  • 11,930
  • 5
  • 50
  • 73
  • 5
    Although I'm not familiar with the internals of PHP, it seems likely that `array_diff()` has gone through several rounds of optimization and is probably going to be faster than any general array-difference function that you roll on your own. If you gave us a better idea as to the structure of your data, there might be a quicker way to calculate differences for the particular arrays that you're looking at. Or, as previous commenters have stated, you might want to take the diff from the database first. –  Jan 11 '12 at 21:31
  • If its numbers (real numbers, i.e, interpretable as integers) you should be able to `sort` them both, and loop through array 1, advancing the pointer with `next` in array 2 as long as value-in-2 < value-in-1, and so on. For this particular case, could be a lot faster. – Wrikken Jan 11 '12 at 21:50
  • 1
    In that case, it might be easier to dump the data from each db to a text file, and compare using a command-line tool like diff rather than trying it in PHP; or to do a dump from one db, load to a temporary table on the second, and use SQL to do the comparison. – Mark Baker Jan 11 '12 at 21:50
  • If it's something you only need to do once, then copy one of the tables over to a temp table in the other database, do the query that gives you the diff, and be done with it. –  Jan 11 '12 at 21:51
  • can you use federated tables? that way you could join both tables in one db and call the result with a distinct. – Gordon Jan 11 '12 at 22:17
  • @Gordon: federated tables are sloooow in a join, at least in MySQL in my experience. It has troubles using indexes too. – Wrikken Jan 11 '12 at 22:19
  • @Wrikken how about UNION then? the requirement right now is < 25 minutes :) – Gordon Jan 11 '12 at 22:22
  • @Gordon: ugh, I'll whip out a test-db for that... – Wrikken Jan 11 '12 at 22:30
  • Thanks guys. It has come down to 2 and half minutes after I copied the second array into a temp table and got the difference through a SQL query. – Amar Jan 11 '12 at 22:50
  • The algorithm took **11 seconds**. Check the update in answer. – Shiplu Mokaddim Jan 11 '12 at 22:51
  • oops, it took 4 minutes 12 seconds to be exact. But nevertheless, it is better than before. Since it is kind of report generation task, one can wait for 5 minutes for it :) – Amar Jan 11 '12 at 22:55

2 Answers2

30

This is the simple algorithm.

  1. Flip 1st array. Values will become keys. So repeated values will be discarded.
  2. Flip 2nd array (optional)
  3. Check for each element in 2nd array if it exists in 1st array.

As you are working with very large arrays, it'll consume a lot of memory.

Here is my implementation,

$a = file("l.a"); // l.a is a file contains 2,500,000 lines
$b = file("l.b");

function large_array_diff($b, $a){
    // Flipping 
    $at = array_flip($a);
    $bt = array_flip($b); 
    // checking
    $d = array_diff_key($bt, $at);

    return array_keys($d);   
}

I ran it using 4G memory limit. 3G also works. Just tested.

$ time php -d memory_limit=4G diff_la.php

It took about 11 seconds!.

real    0m10.612s
user    0m8.940s
sys     0m1.460s

UPDATE

Following code runs 2x faster than large_array_diff function stated above.

function flip_isset_diff($b, $a) {
    $at = array_flip($a);
    $d = array();
    foreach ($b as $i)
        if (!isset($at[$i])) 
            $d[] = $i;

    return $d;
}

Because it does not call array_flip (1 time), array_diff_key and array_keys. Lots of CPU cycles are saved due to this.

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
  • There are no duplicates in these arrays. – Amar Jan 11 '12 at 21:49
  • 1
    Even there is no duplicates, transposing will give ability to search using a *hash*. – Shiplu Mokaddim Jan 11 '12 at 21:59
  • Hmm, flipping it & array_diff_key is about 5 times faster then my method in the other answer, and about 30 times faster then a plain diff in testing here... – Wrikken Jan 11 '12 at 22:25
  • I did try the above mentioned steps but the time taken was above 20 minutes, so I closed the webpage in between to try for other alternatives. – Amar Jan 11 '12 at 22:52
  • 2
    Dont run it in the webpage, Run in console. – Shiplu Mokaddim Jan 11 '12 at 22:54
  • Thanks @Shiplu It came to be 40 seconds over-all and that is Just Awesome. Thanks again :) I might be doing some blunder earlier, so I re-wrote the complete code this time. – Amar Jan 11 '12 at 23:19
  • Your previous code runs 4 seconds faster than the current one. By previous I mean the following: public static function large_array_diff($b, $a){ // transposing $at = array(); foreach ($a as $i) { $at[$i] = 1; } // checking $d = array(); foreach ($b as $i) { if (!isset($at[$i])) { $d[] = $i; } } return $d; } – Amar Jan 11 '12 at 23:24
  • @bluekant this updated code runs faster. It takes 9 seconds in my machine. – Shiplu Mokaddim Jan 12 '12 at 06:03
6

OK, updated, casting to string for good measure (it makes a LOT of difference if we could use integers instead of strings...):

<?php
$start = microtime(true);
echo 'creating' . PHP_EOL;
$array1 = array();
$array2 = array();
for ($i = 0;$i < 2000000;$i++) {
    $array1[] = (string)rand(100000000, 999999999);
    $array2[] = (string)rand(100000000, 999999999);
}
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff flip' . PHP_EOL;
$array1f = array_flip($array1);
$array2f = array_flip($array2);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'key diff' . PHP_EOL;
$diff = array_diff_key($array1f, $array2f);
echo (microtime(true) - $start) . PHP_EOL;
$start = microtime(true);
echo 'sorting' . PHP_EOL;
$start = microtime(true);
sort($array1);
sort($array2);
$notin2 = array();
reset($array2);
echo (microtime(true) - $start) . PHP_EOL;
echo 'comparing' . PHP_EOL;
$start = microtime(true);
foreach ($array1 as $value) {
    while (current($array2) < $value) {
        if (next($array2) == false) break;
    }
    if (current($array2) != $value) $notin2[] = $value;
}
echo (microtime(true) - $start) . PHP_EOL;
echo 'plain diff' . PHP_EOL;
$start = microtime(true);
$diff = array_diff($array1, $array2);
echo (microtime(true) - $start) . PHP_EOL;
?>

Strings

creating
8.66658186913
key diff flip
3.07359004021
key diff
1.48775887489
sorting
48.4047858715
comparing
9.41756916046
plain diff
19.21976614

real    1m37.574s
user    1m34.970s
sys     0m1.676s

Integers (removed (string)):

creating
4.69975805283
key diff flip
2.5843539238
key diff
1.4868490696
sorting
15.2628200054
comparing
5.62516498566
plain diff
101.688895941

real    2m18.090s
user    2m15.112s
sys     0m1.356s

Much to my surprise..... array_diff hates integers it seems...

Shiplu Mokaddim
  • 56,364
  • 17
  • 141
  • 187
Wrikken
  • 69,272
  • 8
  • 97
  • 136