0

Is there an efficient way to combine the return arrays of array_count_values($arr1) and array_count_values($arr2) if $arr1 and $arr2 have elements with the same value?

I'm trying to work through the classical "generate the top 100 search requests from a document that contains 1 billion lines of search requests."

My approach is to use unix split to chop up the document into smaller files, count the number of occurrences of each search term in each file with array_count_values, then reduce all those files into a single file that has a list sorted in descending popularity of each search query.

EDIT For example

$arr1 = array('kurt', 'curt', 'kurt', 'dave', 'krist');
$arr2 = array('dave' 'dave', 'krist', 'krist');

array_count_values($arr1) // ('kurt' => 2, 'curt'=>1, 'dave'=>1, 'krist'=>1)
array_count_values($arr2) // ('dave' => 2, 'krist'=>2)

How can I combine the two to form the following array

('kurt' => 2, 'dave'=>3, 'krist'=>3, 'curt'=>1)
user784637
  • 15,392
  • 32
  • 93
  • 156

3 Answers3

1

Try this :

$arr1 = array('kurt', 'curt', 'kurt', 'dave', 'krist');
$arr2 = array('dave', 'dave', 'krist', 'krist');

$cnt_arr1  = array_count_values($arr1); // ('kurt' => 2, 'curt'=>1, 'dave'=>1, 'krist'=>1)
$cnt_arr2  = array_count_values($arr2); // ('dave' => 2, 'krist'=>2)

$res_arr   = array_merge_recursive($cnt_arr1,$cnt_arr2);

$res       = array();
foreach($res_arr as $key=>$val){
 if(is_array($val)){
    $res[$key]= array_sum($val);
 }else{
    $res[$key]= $val;
 }

}

echo "<pre>";
print_r($res);
Prasanth Bendra
  • 31,145
  • 9
  • 53
  • 73
0

Try this :

$arr = array_merge($arr1,$arr2);
$count = array_count_values($arr);
Prasanth Bendra
  • 31,145
  • 9
  • 53
  • 73
  • This example works for a small data set, but what if I had a document of 1 billion search requests? The requirement is to perform the map step on a worker node and the reduction step on the master node. Assume the sum of size of `$arr1` and `$arr1` is greater than the amount of RAM on a single worker node, the size of `$arr1` and `$arr2` to be less than that amount of RAM when accounted for individually, but the sum of return values of `array_count_values($arr1)` and `array_count_values($arr2)` to be less than the amount of memory on a single worker node – user784637 Feb 22 '13 at 06:27
0

Please try the below code. It will work,

$arr1 = array('kurt', 'curt', 'kurt', 'dave', 'krist');
$arr2 = array('dave', 'dave', 'krist', 'krist');

$array1Count = array_count_values($arr1); // ('kurt' => 2, 'curt'=>1, 'dave'=>1, 'krist'=>1)
$array2Count = array_count_values($arr2); // ('dave' => 2, 'krist'=>2)
$resultArray = array();
foreach($array1Count as $key => $value) {
    if(array_key_exists($key, $array2Count)) {
        $resultArray[$key] = $array1Count[$key] + $array2Count[$key];
    }
    else {
        $resultArray[$key] = $array1Count[$key];
    }
    unset($array2Count[$key]);
}
$finalResultArray = array_merge($array2Count, $resultArray);
print_r($finalResultArray);
Vinoth Babu
  • 6,724
  • 10
  • 36
  • 55