2

I would like to convert this two dimensional array of records:

[records] => Array
(
  [0] => Array
  (
    [0] => Pears
    [1] => Green
    [2] => Box
    [3] => 20
  )
  [1] => Array
  (
    [0] => Pears
    [1] => Yellow
    [2] => Packet
    [3] => 4
  )
  [2] => Array
  (
    [0] => Peaches
    [1] => Orange
    [2] => Packet
    [3] => 4
  )
  [3] => Array
  (
    [0] => Apples
    [1] => Red
    [2] => Box
    [3] => 20
  )
)

Into this three dimensional array where each array key is grouped by a certain value from the original array:

[converted_records] => Array
(
  [Pears] => Array
  (
    [0] => Array
    (
      [0] => Green
      [1] => Box
      [2] => 20
    )
    [1] => Array
    (
      [0] => Yellow
      [1] => Packet
      [2] => 4
    )
  )
  [Peaches] => Array
  (
    [0] => Array
    (
      [0] => Orange
      [1] => Packet
      [2] => 4
    )
  )
  [Apples] => Array
  (
    [0] => Array
    (
      [0] => Red
      [1] => Box
      [2] => 20
    )
  )
)

I can do this like so:

$array = // Sample data like the first array above
$storage = array();
$cnt = 0;
foreach ($array as $key=>$values) {
  $storage[$values[0]][$cnt] = array (
    0 => $values[1],
    1 => $values[2],
    2 => $values[3]
  );
  $cnt ++;
}

I wanted to know if there is a more optimal way to do this. I am not aware of any functions within PHP that are capable of this so I can only assume that this is basically how it would be done.

The problem is though, this is going to be repeated so so many times and every little millisecond is going to count so I really want to know what is the best way to accomplish this task?

EDIT

The records array is created by parsing a .CSV file as follows:

$records = array_map('str_getcsv', file('file.csv'));

EDIT #2

I did a simple benchmark test on a set of 10 results (5k records each) to get an average runtime of 0.645478 seconds. Granted there is a few other things going on before this so this is not a true indication of actual performance but a good indication for comparison to other methods.

EDIT #3

I did a test with about 20x the records. The average of my routine was 14.91971.

At some point the answer below by @num8er had $records[$key][] = array_shift($data); before updating the answer as it is now.

When I tried testing with the larger set of results this it ran out of memory as its generating an error for each record.

This being said, once i did $records[$key][] = $data; the routine completed with an average of 18.03699 seconds with gc_collect_cycles() commented out.

I've reached the conclusion that although @num8ers method is faster for smaller files, for larger ones my method works out quicker.

Craig van Tonder
  • 7,497
  • 18
  • 64
  • 109
  • 1
    Your solution is my solution to this problem :) – Bryant Frankford Aug 24 '15 at 20:24
  • This is already as efficient as it gets. If you want to save some microseconds, you could use OPcode compilers such as the PHP >5.4 build in ZendCompiler or Facebooks HHVM https://code.facebook.com/projects/564433143613123/hhvm/ – BlackCetha Aug 24 '15 at 20:27
  • Thank you both, this leaves me quite sad though :/ Wondering what would be better than PHP for this then in term of processing this millions of times. – Craig van Tonder Aug 24 '15 at 20:31
  • 1
    you dont need to add the keys by hand: `$storage[$values[0]][] = array ($values[1],$values[2],$values[3] );` –  Aug 24 '15 at 20:31
  • Thanks @Dagon, am aware of this but it was only to make it more easily understandable. – Craig van Tonder Aug 24 '15 at 20:32
  • whats the source oif the original array -looks like a db? if so you have already looped to get in to the shown array, you could stop that –  Aug 24 '15 at 20:34
  • @Dagon Source is simply parsed from csv like so: `$csv_file = array_map('str_getcsv', file('file.csv'));`. Seems to be one of the quickest ways to do this in terms of code and overhead/system performance. – Craig van Tonder Aug 24 '15 at 20:36
  • there may some better 'big picture' approach - but we don't have the information for that –  Aug 24 '15 at 20:37
  • @BlackCetha HHVM is actually pretty interesting. In the long run i want to move most of the client facing stuff out of PHP and into node, I still want to use PHP on the backend/server side though so this would actually be really useful... Thanks so much! – Craig van Tonder Aug 24 '15 at 20:37
  • You could build the output array as you parse the CSV so you wouldn't have to traverse the same data again to reprocess it. – Don't Panic Aug 24 '15 at 20:38
  • @Don'tPanic This was my initial approach however it runs too slowly. Trying to speed the process up here so trying to do this in bulk without having to iterate over each record. This is why I've opted to parse into an array the csv file as above in my comments, next step for me logically is to do this process in bulk, like the PHP function `magic_array_split($array)` :-) – Craig van Tonder Aug 24 '15 at 20:41

2 Answers2

2

If you're only looking for some clean code:

$array   = array_map('str_getcsv', file('file.csv'));

$storage = array();
foreach ($array as $values) {
    $key             = array_shift($values);
    $storage[$key][] = $values;
}

Unless you have hundreds of thousands of array entries, speed shouldnt be a concern either.

WhoIsJohnDoe
  • 338
  • 1
  • 8
1

reading big file to memory using file() (1st iteration when it reads file)
and then iterating lines using array_map (2nd iteration after each line of file is read to array)
doing foreach on array (3rd iteration)
it is bad idea when You're looking for performance.

You're iterating 3 times. so what about 100K records? it will iterate 300K times?
most performant way is to do it while reading file. there is only 1 iteration - reading lines (100K records == 100K iteration):

ini_set('memory_limit', '1024M');
set_time_limit(0);

$file = 'file.csv';
$file = fopen($file, 'r');

$records = array();
while($data = fgetcsv($file)) {
  $key = $data[0];
  if(!isset($records[$key])) {
    $records[$key] = array();
  }

  $records[$key][] = array(0 => $data[1],
                           1 => $data[2],
                           2 => $data[3]);
  gc_collect_cycles();
}

fclose($file);


and here is parent -> children processing for huge files:

<?php

ini_set('memory_limit', '1024M');
set_time_limit(0);

function child_main($file)
{
    $my_pid = getmypid();
    print "Starting child pid: $my_pid\n";

    /**
     * OUR ROUTINE
     */

    $file = fopen($file, 'r');
    $records = array();
    while($data = fgetcsv($file)) {
        $key = $data[0];
        if(!isset($records[$key])) {
            $records[$key] = array();
        }

        $records[$key][] = array(0 => $data[1],
            1 => $data[2],
            2 => $data[3]);
        gc_collect_cycles();
    }
    fclose($file);

    unlink($file);

    return 1;
}


$file = __DIR__."/file.csv";
$files = glob(__DIR__.'/part_*');
if(sizeof($files)==0) {
    exec('split -l 1000 '.$file.' part_'); 
    $files = glob(__DIR__.'/part_*');
}

$children = array();
foreach($files AS $file) {
    if(($pid = pcntl_fork()) == 0) {
        exit(child_main($file));
    }
    else {
        $children[] = $pid;
    }
}

foreach($children as $pid) {
    $pid = pcntl_wait($status);
    if(pcntl_wifexited($status)) {
        $code = pcntl_wexitstatus($status);
        print "pid $pid returned exit code: $code\n";
    }
    else {
        print "$pid was unnaturally terminated\n";
    }
}

?>
num8er
  • 18,604
  • 3
  • 43
  • 57
  • 1
    Like a boss, thank you so much for the extra and noteworthy option! FYI, its reading lots of medium files to memory. But yes my problem here is like you're saying... 50k x 200 = 10m == 30m === a looonnng time. Going to test this now and see how it works out :) – Craig van Tonder Aug 24 '15 at 21:31
  • + try to not output debug information to display (to check if read properly) if it's not necessary, cuz outputing data to console is also IO intensive and will cost performance. – num8er Aug 24 '15 at 21:40
  • I don't really have any debug code in here as it gets removed when things work :) I am struggling to make this thing work out for me... Its just hanging, still hasnt completed 1 round. Going to dig around and see if there is a reason why in the apache/php config regarding fopen. – Craig van Tonder Aug 24 '15 at 21:59
  • calling script in apache+php is bad idea, it has execution limits. run this script in console using: php scriptfile.php – num8er Aug 24 '15 at 22:02
  • I am using CLI mode for this script and the tests too, execution limit is 0/unlimited as the entire process has taken initially 6 hours to run. See my edit #2 in my question, the average runtime is 0.6 secs, see the other answer, is was roughly a second more. Yours hangs but will work out why am sure! – Craig van Tonder Aug 24 '15 at 22:05
  • it's one-time job? or You'll have to do it repeatedly? – num8er Aug 24 '15 at 22:06
  • It reads alot of data in batches, repurposes it then writes it out again in batches. Not sure exactly what you mean though, could you expand on the questions? – Craig van Tonder Aug 24 '15 at 22:07
  • updated my answer with timelimit 0 and extenden memory limit to 4 GIGS, tried it? – num8er Aug 24 '15 at 22:11
  • also one more thing what are You doing with this array after? writing back to file or? – num8er Aug 24 '15 at 22:17
  • I am not testing writing back to the files, only reading into an array one of the sample files (but before this there is other things like scandir and db calls so my benchmark is exaggerated but still accurate). – Craig van Tonder Aug 24 '15 at 22:19
  • I'v changed fgets() and str_getcsv() with one call: fgetcsv() to gain a little performance. thinking how to make this operation fastest. (: – num8er Aug 24 '15 at 22:21
  • Think gc_collect_cycles(); fixed the memory leak the first time around but the second time around I think that you did something incredible. Hold on for my benchmark results :) – Craig van Tonder Aug 24 '15 at 22:26
  • Your result was: 0.439912 - A clear winner thus far... As much as I feel you deserve a big reward, i am tempted to bounty this question as i wonder if there is a faster way yet to accomplish this task? – Craig van Tonder Aug 24 '15 at 22:29
  • `!== FALSE` gives `array_shift() expects parameter 1 to be array`. – Craig van Tonder Aug 24 '15 at 22:35
  • `gc_collect_cycles();` increases runtime by 0.01ms :) – Craig van Tonder Aug 24 '15 at 22:36
  • I know about gc_collect_cycles, it forces CPU to do some garbage collection task. so if there is no memory limits You can comment it (disable). I've made a little change: rather than checking big array with isset(), I've craeted array of keys that will keep names of products. – num8er Aug 24 '15 at 22:38
  • Hmm, can you fix the output array it is not matching the requirement and im thinking thats why you got a faster time! – Craig van Tonder Aug 24 '15 at 22:40
  • can I have a part of .csv file to test it? :D – num8er Aug 24 '15 at 22:43
  • ohhh, sorry I've forgot that array_shift gets pointer of variable. fixed (: – num8er Aug 24 '15 at 22:46
  • That worked perfectly wow... Moves all values in at once! Join the discussion? – Craig van Tonder Aug 24 '15 at 22:50
  • nice! happy to help (: – num8er Aug 24 '15 at 23:30
  • 2 downsides with this. It does not scale well at all whereas my option deals with pretty large result sets quite efficiently. Other is that it is not entirely according to the required output, Pears is replacing Green within the converted_records array and as we are pushing the whole array onto it there is no way there to exclude this value (except by manually assigning the information within the array) ;) – Craig van Tonder Aug 25 '15 at 09:22
  • with pretty large result sets it think it's better to shrink file into parts using linux command, try to call it: exec('split -l 10000 file.csv file_'); and then run master php process that gets does it and calls child php scripts passing splited files as argument – num8er Aug 25 '15 at 09:28
  • Interesting approach thanks, will try that and see how it works out compared to having the ability to process massive chunks at once - this might actually be bad because splitting the file would be part of the process and thats a whole different question so would have to compare it to total runtimes :) . – Craig van Tonder Aug 25 '15 at 09:29
  • I've updated my answer. Parent process shrinks file into part_* files with 1000 lines in each and then forks child processes passing one part_* file to each one. with this approach we get some kind of job paralellysing. – num8er Aug 25 '15 at 09:47
  • I will most definitely give it a try for comparison but this will take some time. The thing is, these two routines run at similar speeds for smaller files, yes yours is quicker but it cannot deal with heaps of information. Ultimately, i think that the extra work involved here in processing the files to read will exceed that of processing this in a more singular way, will be interesting to see what becomes of my assumption though! – Craig van Tonder Aug 25 '15 at 09:56
  • See my Edit #3 in the question. – Craig van Tonder Aug 25 '15 at 10:35