389

After using PHP for a while now, I've noticed that not all built-in PHP functions are as fast as expected. Consider these two possible implementations of a function that finds if a number is prime using a cached array of primes.

//very slow for large $prime_array
$prime_array = array( 2, 3, 5, 7, 11, 13, .... 104729, ... );
$result_array = array();
foreach( $prime_array => $number ) {
    $result_array[$number] = in_array( $number, $large_prime_array );
}

//speed is much less dependent on size of $prime_array, and runs much faster.
$prime_array => array( 2 => NULL, 3 => NULL, 5 => NULL, 7 => NULL,
                       11 => NULL, 13 => NULL, .... 104729 => NULL, ... );
foreach( $prime_array => $number ) {
    $result_array[$number] = array_key_exists( $number, $large_prime_array );
}

This is because in_array is implemented with a linear search O(n) which will linearly slow down as $prime_array grows. Where the array_key_exists function is implemented with a hash lookup O(1) which will not slow down unless the hash table gets extremely populated (in which case it's only O(n)).

So far I've had to discover the big-O's via trial and error, and occasionally looking at the source code. Now for the question...

Is there a list of the theoretical (or practical) big O times for all* the built-in PHP functions?

*or at least the interesting ones

For example, I find it very hard to predict the big O of functions listed because the possible implementation depends on unknown core data structures of PHP: array_merge, array_merge_recursive, array_reverse, array_intersect, array_combine, str_replace (with array inputs), etc.

pbarney
  • 2,529
  • 4
  • 35
  • 49
Kendall Hopkins
  • 43,213
  • 17
  • 66
  • 89
  • 31
    Totally off topic but, 1 is not prime. – Restore the Data Dumps Mar 18 '10 at 23:13
  • 25
    Arrays in PHP are hashtables. That should tell you everything you need to know. Searching for a key in a hashtable is O(1). Searching for a value is O(n) -- which you can't beat on an unsorted set. Most of the functions you're curious about are probably O(n). Of course, if you really want to know, you can read the source: http://cvs.php.net/viewvc.cgi/php-src/ext/standard/array.c?revision=1.308.2.21.2.37.2.57&view=markup – Frank Farmer Mar 18 '10 at 23:19
  • 12
    For the record, the fastest implementation of what you're trying to do would be to (instead of using NULL for your values) use `true` and then test for presence using `isset($large_prime_array[$number])`. If I remember correctly, it's in the order of being hundreds of times faster than the `in_array` function. – mattbasta Mar 19 '10 at 01:50
  • 1
    mattbasta; if what you say is true then i just learned the coolest thing i've learned all week. – echo Mar 19 '10 at 04:46
  • @mattbasta wouldn't it still has the same big-O, but just a smaller coefficient? – Kendall Hopkins Mar 19 '10 at 05:47
  • @Kendall Not really sure, but I've used it and it's CRAZY fast. – mattbasta Mar 19 '10 at 19:59
  • @mattbasta According to my tests the only reason that isset is faster is because it's a language construct and there is a bit of over head for array_key_exists. This overhead *according to my test http://pastebin.com/E9WtuzPb* is only ~50% (difference between 0.0000006 and 0.0000004). I believe the reason isset can be orders of magnitude faster is because it can cache lookups when the key is hardcoded, because it's a language construct. I think i'll still stick with array_key_exists because it doesn't have the NULL gotcha. – Kendall Hopkins Mar 20 '10 at 16:44
  • 3
    The Big O notation is not about speed. It’s about limiting behavior. – Gumbo Mar 20 '10 at 17:22
  • 3
    @Kendall I'm not comparing to `array_key_exists`, I'm comparing to `in_array`. `in_array` iterates each item in the array and compares the value to the needle which you pass to it. If you flip the values to the key (and just replace each of the values with a dummy value like `true`, using `isset` is many many times faster. This is because the keys of an array are indexed by PHP (like a hashtable). Consequently, searching an array in this manner can have a significant improvement in speed. – mattbasta Mar 20 '10 at 17:53
  • Interesting! So a quick conversion to use the 2nd, faster form, is to use $prime_array = array_fill_keys( $prime_array, null ) . This turns keys into values and now we can use isset() instead of in_array(). From http://stackoverflow.com/questions/10641865/php-array-turning-array-values-into-keys – Jorge Orpinel Pérez Oct 21 '14 at 21:14
  • $array_of_number is not defined ?? – Muhammad Rizwan Kaim Khani May 12 '16 at 21:03

4 Answers4

746

Since it doesn't seem like anyone has done this before I thought it'd be good idea to have it for reference somewhere. I've gone though and either via benchmark or code-skimming to characterize the array_* functions. I've tried to put the more interesting Big-O near the top. This list is not complete.

Note: All the Big-O where calculated assuming a hash lookup is O(1) even though it's really O(n). The coefficient of the n is so low, the ram overhead of storing a large enough array would hurt you before the characteristics of lookup Big-O would start taking effect. For example the difference between a call to array_key_exists at N=1 and N=1,000,000 is ~50% time increase.

Interesting Points:

  1. isset/array_key_exists is much faster than in_array and array_search
  2. +(union) is a bit faster than array_merge (and looks nicer). But it does work differently so keep that in mind.
  3. shuffle is on the same Big-O tier as array_rand
  4. array_pop/array_push is faster than array_shift/array_unshift due to re-index penalty

Lookups:

array_key_exists O(n) but really close to O(1) - this is because of linear polling in collisions, but because the chance of collisions is very small, the coefficient is also very small. I find you treat hash lookups as O(1) to give a more realistic big-O. For example the different between N=1000 and N=100000 is only about 50% slow down.

isset( $array[$index] ) O(n) but really close to O(1) - it uses the same lookup as array_key_exists. Since it's language construct, will cache the lookup if the key is hardcoded, resulting in speed up in cases where the same key is used repeatedly.

in_array O(n) - this is because it does a linear search though the array until it finds the value.

array_search O(n) - it uses the same core function as in_array but returns value.

Queue functions:

array_push O(∑ var_i, for all i)

array_pop O(1)

array_shift O(n) - it has to reindex all the keys

array_unshift O(n + ∑ var_i, for all i) - it has to reindex all the keys

Array Intersection, Union, Subtraction:

array_intersect_key if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_intersect if intersection 100% do O(n^2*∑param_i_count, for all i), if intersection 0% intersect O(n^2)

array_intersect_assoc if intersection 100% do O(Max(param_i_size)*∑param_i_count, for all i), if intersection 0% intersect O(∑param_i_size, for all i)

array_diff O(π param_i_size, for all i) - That's product of all the param_sizes

array_diff_key O(∑ param_i_size, for i != 1) - this is because we don't need to iterate over the first array.

array_merge O( ∑ array_i, i != 1 ) - doesn't need to iterate over the first array

+ (union) O(n), where n is size of the 2nd array (ie array_first + array_second) - less overhead than array_merge since it doesn't have to renumber

array_replace O( ∑ array_i, for all i )

Random:

shuffle O(n)

array_rand O(n) - Requires a linear poll.

Obvious Big-O:

array_fill O(n)

array_fill_keys O(n)

range O(n)

array_splice O(offset + length)

array_slice O(offset + length) or O(n) if length = NULL

array_keys O(n)

array_values O(n)

array_reverse O(n)

array_pad O(pad_size)

array_flip O(n)

array_sum O(n)

array_product O(n)

array_reduce O(n)

array_filter O(n)

array_map O(n)

array_chunk O(n)

array_combine O(n)

I'd like to thank Eureqa for making it easy to find the Big-O of the functions. It's an amazing free program that can find the best fitting function for arbitrary data.

EDIT:

For those who doubt that PHP array lookups are O(N), I've written a benchmark to test that (they are still effectively O(1) for most realistic values).

php array lookup graph

$tests = 1000000;
$max = 5000001;


for( $i = 1; $i <= $max; $i += 10000 ) {
    //create lookup array
    $array = array_fill( 0, $i, NULL );

    //build test indexes
    $test_indexes = array();
    for( $j = 0; $j < $tests; $j++ ) {
        $test_indexes[] = rand( 0, $i-1 );
    }

    //benchmark array lookups
    $start = microtime( TRUE );
    foreach( $test_indexes as $test_index ) {
        $value = $array[ $test_index ];
        unset( $value );
    }
    $stop = microtime( TRUE );
    unset( $array, $test_indexes, $test_index );

    printf( "%d,%1.15f\n", $i, $stop - $start ); //time per 1mil lookups
    unset( $stop, $start );
}
Derek Pollard
  • 6,953
  • 6
  • 39
  • 59
Kendall Hopkins
  • 43,213
  • 17
  • 66
  • 89
  • Maybe it's good for reverse engineering device drivers. I'll stick to a virtualbox ;) Having fun with it now. Thanks! – d-_-b Apr 22 '11 at 03:23
  • Are you sure hash lookups are O(n)? As I understand, PHP uses strings and integers only for hash keys, so could collisions not be handled using some logn datastructure (for example a binary search tree)? – Cam Jun 10 '11 at 22:23
  • 1
    @Cam Before I address your first question, I'll talk about how indexes are stored. First off, indexes are stored as *strings only* and magically typecast out to integers if they fit integer form. For example all of the follow keys in the array will look like ints if you check their type: `array( 1 => NULL, "1" => NULL, "1000" => NULL, "-100" => NULL )`, but these are strings: `array( "01" => NULL, "--1" => NULL, "1 " => NULL, " 1" => NULL )`. – Kendall Hopkins Jun 11 '11 at 03:12
  • 1
    @Cam (continue) Because of this, there is only 1 type to hash, so your right it could have a O(Log(n)). In practices though, hashing can be faster for small (or even fairly large) values of N, even if it comes at the price of linear polling in the case of collisions. As I said in the post the difference between an array of N=1 and N=1,000,000 is only 2x slower, so the hashing function is obviously large enough to eat of most of the time in most cases. **But to answer your first question**, PHP hash lookups are technically O(n), you'll even see it though, because you'll run out of ram first. – Kendall Hopkins Jun 11 '11 at 03:15
  • @Cam I've updated the post w/ an lookup benchmark graph that shows it's `O(N)`. – Kendall Hopkins Jun 11 '11 at 05:18
  • 6
    @Kendall: Thanks! I did a bit of reading and it turns out PHP uses 'nested' hashtables for collisions. That is, instead of a logn structure for collisions it simply uses another hashtable. And I do understand that practically speaking PHP hashtables give O(1) performance, or at least O(1) on average - that's what hashtables are for. I was just curious as to why you said they are "really O(n)" and not "really O(logn)". Great post by the way! – Cam Jun 11 '11 at 08:32
  • 2
    @Cam Big-O is the upper bounds of function when N->infitity. While the function has `O(1)` and `O(log(N))` during the "fillup up stages" it eventually stabilizes out to `O(N)`. – Kendall Hopkins Jun 11 '11 at 16:09
  • 2
    @Kendall: In the worst case it's true that it will have O(n) performance. But I think the average-case performance is still O(1). To see this: When a collision occurs, a new hashtable is created for all the values at that key, and the keys are rehashed there. It is very unlikely that collisions will occur there again. As we add hashtables at deeper and deeper depths to deal with collisions, we're actually talking about a smaller and smaller subset of the input, since these nested collisions will happen very rarely. So the average-case runtime is O(1), and not O(n) with a small constant. – Cam Jun 11 '11 at 20:17
  • This is how I understand it at least - if I'm incorrect I would really like to understand why it is O(n) even in the average case. – Cam Jun 11 '11 at 20:21
  • 1
    @Cam I think your incorrection on how PHP's `array` is implemented. PHP's array (or hash) is implemented using [chaining](http://en.wikipedia.org/wiki/Hash_table#Separate_chaining). Which means while your filling up the first layer, it's basically O(1) due to the expensive hashing overhead, but once the first layer is filled, you have to use linear polling (walking a list) to reach the collision. This too is also insignificant compared to the hashing overhead for any realistic value `N<10000000` (why it's O(1) in practice). If your still unclear read more of the wiki page. – Kendall Hopkins Jun 11 '11 at 23:06
  • 12
    Time complexities should be included with the documentation! Choosing the right function can save you so much time, or tell you to avoid doing what you planned to :p Thanks for this list already! – Samuel Jun 09 '12 at 23:38
  • 1
    Quick note: isset is an operator, so it is expected (many blog posts and documented too afair). key_exists only does a hash lookup, which is much faster than searching all the array elems. – Pierre Jan 07 '13 at 20:10
  • 50
    I know this is old ... but what? That curve doesn't show O(n) at all, it shows O(log n), http://en.wikipedia.org/wiki/Logarithm. Which is also accurate with what you would expect for nested hash-maps. – Andreas May 05 '13 at 14:58
  • 8
    What is the Big-O of unset on an element of an array? – Chandrew Apr 30 '14 at 16:51
  • 16
    While hashtables indeed have worst-case O(n) lookup complexity, the average case is O(1) and the particular case your benchmark is testing is even *guaranteed* O(1), as it's a zero-based, continuous, numerically-indexed array, which will never have hash collisions. The reason why you're still seeing a dependence on the array size has nothing to do with algorithmic complexity, it is caused by CPU cache effects. The larger the array is, the more likely it is that random-access lookups will result in cache misses (and cache misses higher in the hierarchy). – NikiC Jan 28 '16 at 16:54
  • 2
    FYI, PHP 7 uses a [greatly improved hashtable implementation](https://nikic.github.io/2014/12/22/PHPs-new-hashtable-implementation.html) for arrays. – ToolmakerSteve Apr 07 '19 at 21:35
6

The explanation for the case you specifically describe is that associative arrays are implemented as hash tables - so lookup by key (and correspondingly, array_key_exists) is O(1). However, arrays aren't indexed by value, so the only way in the general case to discover whether a value exists in the array is a linear search. There's no surprise there.

I don't think there's specific comprehensive documentation of the algorithmic complexity of PHP methods. However, if it's a big enough concern to warrant the effort, you can always look through the source code.

Dathan
  • 7,266
  • 3
  • 27
  • 46
  • This isn't really an answer. As I've stated in the question, I've already tried looking into the PHP source code. Since PHP is implemented is written in C making use of complex macros, which can make it hard at times to "see" the underlying big O for functions. – Kendall Hopkins Mar 18 '10 at 23:30
  • 1
    @Kendall I overlooked your reference to diving into the source code. However, there is an answer in my reply: "I don't think there's specific comprehensive documentation of the algorithmic complexity of PHP methods." "No" is a perfectly valid answer. (c: – Dathan Mar 19 '10 at 00:08
6

You almost always want to use isset instead of array_key_exists. I'm not looking at the internals, but I'm pretty sure that array_key_exists is O(N) because it iterates over each and every key of the array, while isset tries to access the element using the same hash algorithm that is used when you access an array index. That should be O(1).

One "gotcha" to watch out for is this:

$search_array = array('first' => null, 'second' => 4);

// returns false
isset($search_array['first']);

// returns true
array_key_exists('first', $search_array);

I was curious, so I benchmarked the difference:

<?php

$bigArray = range(1,100000);

$iterations = 1000000;
$start = microtime(true);
while ($iterations--)
{
    isset($bigArray[50000]);
}

echo 'is_set:', microtime(true) - $start, ' seconds', '<br>';

$iterations = 1000000;
$start = microtime(true);
while ($iterations--)
{
    array_key_exists(50000, $bigArray);
}

echo 'array_key_exists:', microtime(true) - $start, ' seconds';
?>

is_set: 0.132308959961 seconds
array_key_exists: 2.33202195168 seconds

Of course, this doesn't show time complexity, but it does show how the 2 functions compare to each other.

To test for time complexity, compare the amount of time it takes to run one of these functions on the first key and the last key.

ryeguy
  • 65,519
  • 58
  • 198
  • 260
  • 11
    This is wrong. I'm 100% sure array_key_exists doesn't have to iterate over each key. If you don't believe be take a look at the link below. The reason isset is so much faster is that it's a language construct. Which means it doesn't have the overhead of doing a function call. Also, I think it might be caching the lookup, because of this. Also, this isn't an answer to THE QUESTION! I would like a list of Big(O) for PHP functions (as the question states). Not a single benchmark of my examples. http://svn.php.net/repository/php/php-src/branches/PHP_5_3/ext/standard/array.c – Kendall Hopkins Mar 20 '10 at 15:27
  • If you still don't believe me, I've create a small benchmark to demonstrate the point. http://pastebin.com/BdKpNvkE – Kendall Hopkins Mar 20 '10 at 16:00
  • What is wrong with your benchmark is that you have to disable xdebug. =) – Guilherme Blanco Jan 07 '13 at 19:38
  • 4
    There are two critical reasons why you want to use isset over array_key_exists. First, isset is a language construct mitigating the cost of a function call. This is akin to the `$arrray[] = $append` vs `array_push($array, $append)` argument. Second, array_key_exists also differentiates between non-set and null values. For `$a = array('fred' => null);` `array_key_exists('fred', $a)` will return true while `isset($['fred'])` will return false. This extra step is non-trivial and will greatly increase execution time. – orca Jan 29 '13 at 19:16
0

If people were running into trouble in practice with key collisions, they would implement containers with a secondary hash lookup or balanced tree. The balanced tree would give O(log n) worst case behavior and O(1) avg. case (the hash itself). The overhead is not worth it in most practical in memory applications, but perhaps there are databases that implement this form of mixed strategy as their default case.