2

I am writing a script that will repeatedly search a large group of arrays (40,000) and merge all of the arrays that have at least one common element. I have tried array_intersect(), but I found that it was too slow for this application. Is there another function that is faster and simply returns true if at least one element is shared between two arrays?

It is my assumption that array_intersect() is slowed down by the fact that both arrays are reviewed completely and the commons values are grouped together and returned. It would be faster to exit when a single match is found.

To clarify: All arrays are held with an another master array (which is a 2d array.) If it is discovered that the arrays stored at $master[231] and $master[353] both contain the element 124352354, they should be merged into an new array and the result stored in a different 2d array designed to store merged results.

Current code:

$test = array_intersect($entry, $entry2);
if($test){

...

}

A better method is:

foreach($entry as $item){
     if(in_array($item, $entry2)){
         $test = true;
         break;
     }
}
if($test){

...

}

and another improvement is using isset() and array_flip() instead of in_array();

$flipped = array_flip($entry2);
foreach($entry as $item){
     if(isset($flipped[$item]){
         $test = true;
         break;
     }
}
if($test){

...

}
Hoytman
  • 1,722
  • 2
  • 17
  • 29
  • i would go back a step or 3 why do you have such large arrays? –  Jul 16 '14 at 21:07
  • It's going to be O(m+n) if you're using arrays. What are you storing in this array? Can you change data structures? – Sean Bright Jul 16 '14 at 21:08
  • The arrays store company contact phone numbers (inverted so that the key is a phone number and the value is an id number) This is an attempt at finding duplicate companies entries by assuming that if two companies entries share a common phone number, they are probably the same company. – Hoytman Jul 16 '14 at 21:12
  • 1
    Sounds like you're trying to reimplement SQL operations in PHP. [XY Problem](http://meta.stackexchange.com/a/66378) – Sammitch Jul 16 '14 at 21:23
  • You mean you want to get a merged array of arrays with **common elements** or a merged **array of arrays with common elements**? – Chibueze Opata Jul 16 '14 at 21:29
  • @Sammitch is 100% right –  Jul 16 '14 at 21:36

5 Answers5

3

Assuming you want to just discover if two arrays have a common element, you could create your own getIntersect function which would be faster than using array_intersect since it would return instantly on first match.

function getIntersect($arr1, $arr2)
{
    foreach($arr1 as $val1)
    {
        foreach($arr2 as $val2)
        {
            if($val1 == $val2)
            { return true; }
        }
    }
    return false;
}

Assuming that what you really want to find is arrays in which one element occurs at least more than once.

Then you could easily have

function hasCommonElements($arr)
{
    for($i = 0; $i < count($arr); $i++)
    {
        $val = $arr[$i];
        unset($arr[$i]);
        if(in_array($val, $arr))
        {
            return true;
        }
    }
}

And you could easily get an array of all arrays containing common elements using array_filter:

array_filter($my40k, "hasCommonElements");

Assuming that what you really want to do is to find all arrays which have at least one value in common, you have to do a higher level array filter.

$mybigarray;//your big array

function hasIntersects($arr)
{
    for($i = 0; $i < count($mybigarray); $i++)
    {
        if(getIntersect($arr, $mybigarray[$i]))
        {
            return true;
        }
    }
}

Then call our filter monster

array_filter($mybigarray, "hasIntersects");

Disclaimer: None of this stuff is tested. Check for typos

Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
  • getIntersect() can be used to accomplish the task, but I would like to use something faster. I have found that foreach(in_array()) is slightly faster, but foreach(foreach()) is not. – Hoytman Jul 17 '14 at 13:54
2

If your arrays contain only values that are also valid keys (integers, ...), it could be better to flip the arrays (swap keys and values), which technically means build and index, and search on the keys. Examples:

function haveCommonValues1($a1, $a2) {
    $a1_flipped = array_flip($a1);
    foreach ($a2 as $val) {
        if (isset($a1_flipped[$val])) {
            return true;
        }
    }
    return false;
}

or if you need the intersection:

function haveCommonValues2($a1, $a2) {
    $a1_flipped = array_flip($a1);
    $a2_flipped = array_flip($a2);
    return array_intersect_key($a1_flipped, $a2_flipped);
}

On some test arrays i got these results, this however depends highly on the array structures. So you need to test it and compare the times.

array_intersect   : 0m1.175s
haveCommonValues1 : 0m0.454s
haveCommonValues2 : 0m0.492s
steffen
  • 16,138
  • 4
  • 42
  • 81
  • 1
    isset() is much faster that in_array(). It more than offsets the added time cost of flipping the array. Thanks! – Hoytman Jul 17 '14 at 14:12
  • 1
    @Hoytman That's correct. The common way to search is `array_flip()` and `isset()`. But this only works with arrays that have valid array-keys as values as said... – steffen Jul 17 '14 at 14:14
1

Busting-up pre-conceived notions all around

I both proved my point, and undermined it, in one answer. TL;DR = Do this in SQL.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

PHP built-ins are code-level cleaner, but surprisingly inefficient.

If you are going by key, your best bet would be the tried and true for-loop:

$array1 = array( ... );
$array2 = array( ... );

$match = FALSE;

foreach($array1 as $key => $value) {
    if(isset($array2[$key]) && $array2[$key] == $value) {
        $match = TRUE;
        break;
    }
}

It may seem counter-intuitive, but at the execution level, no mater what, you have to iterate over every item in at least one array. You can keep this shorter by only doing it on the shorter array.

array_intersect() keeps going for every key in both arrays, so, although it looks crazy, you just have to do it the "dirty" way.

If the data comes from a database, it would actually be faster to have the SQL engine do the lifting. A simple join with a limit 1 will give you a flag to know if there are duplicates, and then you can execute another query to get the merged data (dynamically generate the query against multiple tables or source queries if you need to do this on more than one pair).

SQL will be magnitudes faster than any higher-level language, like PHP, for doing this. I don't care if you already have the arrays in memory, executing the query and loading the new array from the DB will be faster than trying to do the compare and then merge in resident memory of the App...

Again, with counter-intuitive things...

So, this is interesting:

I made a test script for this at http://pastebin.com/rzeQnyu2

With the matcher (phone number) at 5 digits, the foreach loop consistently executed in 1/100 the time of the other option. HOWEVER, up that to 10 digits (removing all possibility of collision) and the foreach jumps to 36x slower than the other option.

# 5 digit key (phone number)
Match  found
For Each completed in 0.0001
Intersect Key completed in 0.0113

# 10 digit key
Match not found
For Each completed in 0.2342
Intersect Key completed in 0.0064

I wonder why the second option (which likely had larger arrays) was faster for Intersect than the smaller one... WEIRD...

This is because, while the intersect always iterates over all items, the foreach loop wins when it can exit early, but looks to be VERY slow if it doesn't get that opportunity. I would be interested in the deeper technical reasons for this.

EIther Way - in the end, just do it in SQL.

Mike
  • 1,968
  • 18
  • 35
  • Your code example would demonstrate that the arrays have a shared key, not a shared value. – kojiro Jul 16 '14 at 21:20
  • The user is checking against the phone number, which they stated is the key as well, I will modify the code to do a secondary value check to be super super sure. – Mike Jul 16 '14 at 21:30
  • Better use `array_key_exists()` - `isset()` will return `false`, if the key exists, but the value is set to `NULL`. – dognose Jul 16 '14 at 21:46
  • This sounds like a good idea, but when I tested it, I was surprised by the times. You method only beats the array_intersect method when the arrays are small (only 2 or 3 elements.) I think that the method closures add a lot of overhead. – Hoytman Jul 16 '14 at 21:47
  • @Hoytman - Yeah, I am running some tests on 3v4l.org, and bugger if I can't nail this down. I really hate PHP's array built-ins, because right when I think I have discovered the trick of "it works here but not here", it does the opposite. I really just wish they were more of a clear winner... 3v4l keeps blocking me, so I will have to do this on my own server... No fancy OpCodes for me I guess.. :-D – Mike Jul 16 '14 at 21:57
0

array_intersect has a runtime of O(n * log(n)) because it uses a sorting algorithm before the comparison itself. Depending on your input you can improve that in many different ways (e.g. if you have integers from a small range you may impelement the algorithm using counting sort).

You can find an example for that optimizations right here or here. A possible solution where you won't need sorting is posted in this thread. It also has linear time so I guess this is what your are looking for.

Community
  • 1
  • 1
Daniel K.
  • 1,189
  • 1
  • 10
  • 26
  • 1
    The only problem with `array_intersect` is that it will compute the complete intersection, which does not sound like what the asker desires. They simply desire a flag as to whether or not there were any matches. – Mike Jul 16 '14 at 21:28
0
SELECT *
FROM contacts t1 INNER JOIN contacts t2
  ON t1.phone = t2.phone
  AND t1.AccountID < t2.AccountID

Also, if your system may ever grow to include international phone numbers you should store them as a string type. There are countries, I believe in Europe, that use leading zeroes in their phone numbers and you cannot properly store them with a numeric type.

edit

The below query will return all instances of phone numbers used multiple times with no duplicate rows no matter how many accounts are sharing a phone number:

SELECT DISTINCT t1.AccountID, t1.phone
FROM contacts t1 INNER JOIN contacts t2
  ON t1.phone = t2.phone
  AND t1.AccountID != t2.AccountID
ORDER by t1.phone

I'd include a SQLfiddle, but it seems to be broken atm. This is the schema/data I used as a test:

CREATE TABLE IF NOT EXISTS `contacts` (
  `AccountID` int(11) NOT NULL,
  `phone` varchar(32) NOT NULL,
  KEY `aid` (`AccountID`),
  KEY `phn` (`phone`)
)

INSERT INTO `contacts` (`AccountID`, `phone`) VALUES
    (6, 'b'),
    (1, 'b'),
    (1, 'c'),
    (2, 'd'),
    (2, 'e'),
    (3, 'f'),
    (3, 'a'),
    (4, 'a'),
    (5, 'a'),
    (1, 'a');
Community
  • 1
  • 1
Sammitch
  • 30,782
  • 7
  • 50
  • 77
  • Why are there 3 lists? also, the arrays are built from a table with the following format: `AccountID` int(11), `phone` bigint(20). each AccountID can have multiple phone numbers and each phone number can have (problematic) multiple AccountID – Hoytman Jul 17 '14 at 14:19
  • @Hoytman weird, I could have sworn the question said 3 lists. One sec, I'll rewrite this. – Sammitch Jul 17 '14 at 17:20
  • Actually, there is only one table. There is only one list, and each list element needs to be compared against all the other elements in the same list. Is it possible to INNER JOIN a table to itself? How would that react if there are three or more entries with the same phone number? Thanks for the revisions so far! – Hoytman Jul 17 '14 at 18:07
  • Joining a table to itself is exactly what is happening in this query, the only requirement is to assign each instance of the table a distinct alias, eg: `t1` and `t2`. I went back to the drawing board and tweaked the query and found a method that will not give duplicate results for phone numbers used three or more times. – Sammitch Jul 17 '14 at 23:37