94

in a very tight loop I need to access tens of thousands of values in an array containing millions of elements. The key can be undefined: In that case it shall be legal to return NULL without any error message:

Array key exists: return value of element. Array key does not exist: return null.

I do know multiple solutions:

    if (isset($lookup_table[$key])) {
        return $lookup_table[$key];
    } else {
        return;
    }

or

@return $lookup_table[$key];

or

error_reporting(0);
$return = $lookup_table[$key];
error_reporting(E_ALL);
return $return;

All solutions are far from optimal:

  • The first one requires 2 lookup in the B-TREE: One to check existence, another to retrieve value. That effectively doubles the runtime.
  • The second one uses the error suppression operator, and thus creates a massive overhead on that line.
  • The third one calls the error handler (that will check error_reporting setting and then display nothing) and thereby creates an overhead.

My question is if I miss a way to avoid error handling and yet work with a single Btree lookup?

To answer some questions:

The array caches the results of a complex calculation - to complex to be done in real time. Out of billions of possible values, only millions yield a valid result. The array looks like 1234567 => 23457, 1234999 => 74361, .... That is saved to a PHP file of several megabyte, and include_once-d at the beginning of the execution. Initial load time does not matter. If the key is not found, it simply means that this specific value will not return a valid result. The trouble is to get this done 50k+ per second.

Conclusion

Edit: outdated, check the accepted answer.

As there is no way found to get the value with a single lookup and without error handling, I have trouble accepting a single answer. Instead I upvoted all the great contributions.

The most valuable inputs where:

  • use array_key_exists, as it is faster than alternatives
  • Check out PHP's QuickHash

There was a lot of confusion on how PHP handles arrays. If you check the source code, you will see that all arrays are balanced trees. Building own lookup methods is common in C and C++, but is not performant in higher script-languages like PHP.

EDIT: As to "duplicate question": This question deals with performance benchmarks, not with an explanation of the meaning of "undefined index".

Zsolt Szilagyi
  • 4,741
  • 4
  • 28
  • 44
  • `isset($lookup_table[$key])` will return `false` if `$lookup_table[$key]` is `null` ... isn't this the opposite of what you want? not sure what its performance is like, but `array_key_exists` will return `true` if the key exists but its value is `null`. – user428517 May 21 '13 at 17:16
  • @sgroves: The example did reflect my intention. I updated the question to clarify: If key exists, return value, otherwise return null. (Value of existent keys will never be null.) array_key_exist is, frankly, slower that the isset version, so I skipped it in my examples. – Zsolt Szilagyi May 21 '13 at 17:22
  • 2
    [Quickhash](http://us3.php.net/manual/en/book.quickhash.php) might be good for your case. – Alex Howansky May 21 '13 at 17:24
  • "an array containing millions of elements". You may need to rethink your strategy and involve some type of pagination or way to segment the data. – Mike Purcell May 21 '13 at 17:27
  • @MikePurcell Under other circumstances you'd be more than right. Unfortunately here is performance of the essence, and analysing a value to get the right array to use would cost far more than a a further level in the btree. – Zsolt Szilagyi May 21 '13 at 17:33
  • Can you explain how the array is created? Is it read from a file, static PHP or from SQL? – Reactgular May 21 '13 at 17:38
  • "The key can be undefined". How is this possible? – Mike Purcell May 21 '13 at 17:39
  • I believe the fastest way to handle this would be to use a different language. – Sean Bright May 21 '13 at 17:47
  • 1
    Mike, Mathew, I updated my question to answer yours. Sean, that´s not a matter of believe. Of course native C is faster than PHP. I am simply unwilling to spend a month writing code I can write over night in PHP. :) – Zsolt Szilagyi May 21 '13 at 17:53
  • You clearly **can't** write it over night in PHP. That's why you're here. – Sean Bright May 21 '13 at 17:56
  • @sean: It´s 8pm in germany :) (In other words - the only constant when asking for help building a house is that you can´t use advise on how to build a boat) – Zsolt Szilagyi May 21 '13 at 18:01
  • @AlexHowansky checking out Quickhash now. Looks promising! – Zsolt Szilagyi May 21 '13 at 18:06
  • It seems you are comparing some master hash list with a hash list created after executing an algorithm. If order is arbitrary, why not write the master hash list to one file, then the results of the algoritm to another file, then use the linux `diff` command to compare them. – Mike Purcell May 21 '13 at 18:19
  • Have you tried this in HHVM? Plus, a few 10k accesses in a million-odd items that you don't need in 'real' or 'interactive' time can't possibly be that slow. What's the limiting factor here that drives this kind of micro-optimization? – pvg Mar 15 '17 at 01:57
  • @pvg I run that in a daemon running 24/7 repeating the job, and the business value depended on how fast that was possible. – Zsolt Szilagyi Mar 15 '17 at 15:17
  • That doesn't really explain very much. It's hard to imagine a computationally bound algo that is actually significantly limited by PHP array access time. Have you measured any of this stuff? – pvg Mar 15 '17 at 15:28
  • @ZsoltSzilagyi: Since you list your summary of useful responses in the question itself, you may wish to edit it to either remove that, or reflect the now-accepted answer of the null coalesce operator, `??`. – lindes Dec 16 '22 at 02:46

8 Answers8

149

Update

Since PHP 7 you can accomplish this with the null coalesce operator:

return $table[$key] ?? null;

Old answer

First of all, arrays are not implemented as a B-tree, it's a hash table; an array of buckets (indexed via a hash function), each with a linked list of actual values (in case of hash collisions). This means that lookup times depend on how well the hash function has "spread" the values across the buckets, i.e. the number of hash collisions is an important factor.

Technically, this statement is the most correct:

return array_key_exists($key, $table) ? $table[$key] : null;

This introduces a function call and is therefore much slower than the optimized isset(). How much? ~2e3 times slower.

Next up is using a reference to avoid the second lookup:

$tmp = &$lookup_table[$key];

return isset($tmp) ? $tmp : null;

Unfortunately, this modifies the original $lookup_table array if the item does not exist, because references are always made valid by PHP.

That leaves the following method, which is much like your own:

return isset($lookup_table[$key]) ? $lookup_table[$key] : null;

Besides not having the side effect of references, it's also faster in runtime, even when performing the lookup twice.

You could look into dividing your arrays into smaller pieces as one way to mitigate long lookup times.

Community
  • 1
  • 1
Ja͢ck
  • 170,779
  • 38
  • 263
  • 309
  • 2
    I was skeptical about this approach, so I generated a random array of about 5000 keys in the range of 0-100,000, and ran through the array from 0-100,000. I tried your methoed with reference, a method without reference, but still an auxiliary variable, and a simple approach without either auxiliary variable or reference. Your reference approach took more than twice as long as the other ones. The naive approach was marginally faster than the auxiliary variable approach. – kba May 21 '13 at 17:40
  • 1
    @kba The answer mentions that it **modifies the array** (if the item doesn't exist), so obviously that takes time. It also mentions that without it, the first solution given by OP is likely the fastest. – Ja͢ck May 21 '13 at 17:43
  • In what way does your approach modify the array? I don't see you making any changes to `$tmp` or `$lookup_table[$key]`. – kba May 21 '13 at 17:49
  • @kba Just try this `$x = array(); $tmp = &$x[123]; print_r($x);` :) – Ja͢ck May 21 '13 at 17:50
  • Yes, that will modify the array, but that is not what you're doing. You're returning `$tmp` or `null`, you're not modifying `$tmp` in your code example. – kba May 21 '13 at 17:50
  • @kba Look again, that's almost exactly what I'm doing. Do you see me modifying `$tmp` in that short code snippet? – Ja͢ck May 21 '13 at 17:52
  • I tested it and you seem to be right, but I don't quite understand how `$tmp` is being modified. Does `isset` have side effects? I don't see you assigning anything to `$tmp` in the second line. – kba May 21 '13 at 17:55
  • @Jack, could you describe what exactly happens? That would be highly interesting and could immediately end both discussion and guessing game. Thanks! – Zsolt Szilagyi May 21 '13 at 17:56
  • 3
    @ZsoltSzilagy When you create a reference on a non-existent variable, PHP creates the necessary structures to make it a valid reference; in this case, it creates an array entry with value `null`. It works well if the item will always be there of course :) – Ja͢ck May 21 '13 at 17:58
  • The more you know! I can't remove my downvote now, unless you edit your entry. If you could add that explanation to your answer, I can remove my downvote and upvote you instead. – kba May 21 '13 at 18:01
  • @Jack That is great! It could blow my RAM after some hours of runtime, but I will find a way to array_filter while it is idle and waiting for input. – Zsolt Szilagyi May 21 '13 at 18:05
  • 2
    Still, however, my tests imply that this doesn't improve the performance after all. Even after having run through every item in the array once, the lookup time is still worse. [My tests could be bad](http://paste.debian.net/5585/). – kba May 21 '13 at 18:09
  • @kba Then I guess reference are both potentially modifying the array *and* slow :) I'll add it to my answer, thanks for the harder evidence! – Ja͢ck May 21 '13 at 18:13
  • 2e3 times slower is 2000 times slower? It's not true, 3 times is much more closer to function call overhead against simple instruction. – Alexander Shostak Jan 25 '19 at 15:04
  • @AlexanderShostak not sure whether i meant to write 2-3 or 2k, actually. but it is slower :) – Ja͢ck Jan 25 '19 at 22:43
  • @Jack, I agree. About 2 times in modern PHP versions in my tests and pull request for PHP 7.3 (https://github.com/php/php-src/pull/3360). It seems, that from PHP 7.3+ array_key_exists will be implemented via separate opcode and be even faster than isset. Thus we can use it everywhere even now (the performance penalty is not huge in PHP 7.3-). – Alexander Shostak Jan 26 '19 at 12:19
7

I did some bench marking with the following code:

set_time_limit(100);

$count = 2500000;
$search_index_end = $count * 1.5;
$search_index_start = $count * .5;

$array = array();
for ($i = 0; $i < $count; $i++)
    $array[md5($i)] = $i;

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = isset($array[$key]) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = array_key_exists($key, $array) ? $array[$key] : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";


$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = @$array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

$error_reporting = error_reporting();
error_reporting(0);
$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $test = $array[$key];
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";
error_reporting($error_reporting);

$start = microtime(true);
for ($i = $search_index_start; $i < $search_index_end; $i++) {
    $key = md5($i);
    $tmp = &$array[$key];
    $test = isset($tmp) ? $tmp : null;
}
$end = microtime(true);
echo ($end - $start) . " seconds<br/>";

and I found that the fastest running test was the one that uses isset($array[$key]) ? $array[$key] : null followed closely by the solution that just disables error reporting.

Sean Bright
  • 118,630
  • 17
  • 138
  • 146
DiverseAndRemote.com
  • 19,314
  • 10
  • 61
  • 70
4

This Work for me

{{ isset($array['key']) ? $array['key']: 'Default' }} 

but this is fast

{{ $array['key'] or 'Default' }}
1

There are two typical approaches to this.

  1. Define defaults for an undefined key.
  2. Check for undefined key.

Here is how to perform the first and as little code as possible.

$data = array_merge(array($key=>false),$data);
return $data[$key];

Here is how to perform the second.

return isset($data[$key]) ? $data[$key] : false;
Reactgular
  • 52,335
  • 19
  • 158
  • 208
  • Thank you. The second version with the ternary operator is a double-lookup like in my first snippet. The first version is exciting, but I have little hope: It temporaly creates a copy of a *huge* array, and while there is only one btree lookup, there is a complete (temporary) array to be balanced. Runtime raises from 2log(n) to 2n. – Zsolt Szilagyi May 21 '13 at 17:27
  • 2
    Do you have the option to re-organize this data for performance? An octree lookup would be more efficient than a linear search. You could group by the first letter of each key, and then by the second letter and so on. http://en.wikipedia.org/wiki/Octree – Reactgular May 21 '13 at 17:32
  • Great approach! PHP internally handles arrays as fully balanced trees. I assume that won´t be outperformed by an octree, but I am not that deep into algorithm analysis. Do you have experience on that comparison? Thanks! – Zsolt Szilagyi May 21 '13 at 17:39
  • The only way PHP could be faster is if the keys are sorted internally, and an optimized lookup is used to find a key. I've never read anything that says PHP uses optimized indexes for array keys. – Reactgular May 21 '13 at 17:42
1

Just a sudden idea that would have to be tested, but did you try using array_intersect_key() to get the existing values and a array_merge to fill() the rest ? It would remove the need of a loop to access the data. Something like that :

$searched_keys = array ('key1' => null, 'key2' => null); // the list of the keys to find

$exiting_values = array_intersect_key($lookup_table, $searched_keys);
$all_values = array_merge($searched_keys, $exiting_keys);

Please note that I did not tried it performance-wise.

Aralicia
  • 875
  • 5
  • 15
1

The @ operator and the error_reporting methods will both be slower than using isset. With both of these methods, it modifies the error reporting setting for PHP, but PHP's error handler will still be called. The error handler will check against the error_reporting setting and exit without reporting anything, however this has still taken time.

thomasrutter
  • 114,488
  • 30
  • 148
  • 167
1

I prefer using the isset function instead of escaping the error. I made a function to check the key exists and if not returns a default value, in the case of nested arrays you just need to add the other keys in order:

Nested array lookup:

/**
 * Lookup array value.
 *
 * @param array $array
 * @param array $keys
 * @param $defaultValue
 */
public static function array_key_lookup($array, $keys, $defaultValue)
{
    $value = $array;
    foreach ($keys as $key) {
        if (isset($value[$key])) {
            $value = $value[$key];
        } else {
            $value = $defaultValue;
            break;
        }
    }

    return $value;
}

Usage example:

$array = [
    'key1' => 'value1',
    'key2' => 'value2',
    'key3' => [
        'key3a' => 'value3a',
        'key3b' => 'value3b'
    ]
];

array_key_lookup($array, ['key3', 'key3a'], 'default')
'value3a'

array_key_lookup($array, ['key2', 'key2a'], 'default')
'default'

array_key_lookup($array, ['key2'], 'default')
'value2'

array_key_lookup($array, ['key5'], 'default')
'default'

Escaping the error:

$value = @$array[$key1][$key2] ?: $defaultValue;
Ivan Babic
  • 21
  • 2
  • 1
    Hey, that's a nice approach form the usability perspective, but performancewise it's actually worse than the alternatives. The function call implicitely creates an array, the foreach makes a copy of it, the first line duplicates a variable etc. It would be much faster to call $value = isset($array[$key1][$key2]) ? $array[$key1][$key2] : $defaultValue; – Zsolt Szilagyi Oct 22 '19 at 12:30
0

First, re-organize the data for performance by saving a new array where the data is sorted by the keys, but the new array contains a regular numeric index.

This part will be time consuming, but only done once.

 // first sort the array by it's keys
 ksort($data);

 // second create a new array with numeric index
 $tmp = new array();
 foreach($data as $key=>$value)
 {
    $tmp[] = array('key'=>$key,'value'=>$value);
 }
 // now save and use this data instead
 save_to_file($tmp);

Once that is done it should be quick to find the key using a Binary Search. Later you can use a function like this.

  function findKey($key, $data, $start, $end)
  { 
    if($end < $start) 
    { 
        return null; 
    } 

    $mid = (int)(($end - $start) / 2) + $start; 

    if($data[$mid]['key'] > $key) 
    { 
        return findKey($key, $data, $start, $mid - 1); 
    } 
    else if($data[$mid]['key'] < $key) 
    { 
        return findKey($key, $data, $mid + 1, $end); 
    } 

    return $data[$mid]['value'];
 }

To perform a search for a key you would do this.

 $result = findKey($key, $data, 0, count($data));
 if($result === null)
 {
      // key not found.
 }

If the count($data) is done all the time, then you could cache that in the file that you stored the array data.

I suspect this method will be a lot faster in performance then a regular linear search that is repeated against the $data. I can't promise it's faster. Only an octree would be quicker, but the time to build the octree might cancel out the search performance (I've experienced that before). It depends on how much searching in the data you have to do.

Reactgular
  • 52,335
  • 19
  • 158
  • 208