0

I am importing a csv that could potentially have around 100,000 rows. Each row has 5 columns. The first column of each row will have a sentence and the other 4 columns have numeric values. I need to loop through the csv data and grab each word from each row and add it as a it's own row in a new array. So for example:

| big red truck   | 5 | 2 | 5 | 1 |
| small red truck | 4 | 2 | 0 | 0 |
| big fast truck  | 3 | 2 | 4 | 1 |

becomes

| truck | 12 | 6 | 9 | 2 |
| red   | 9  | 4 | 5 | 1 |
| fast  | 3  | 2 | 4 | 1 |
| small | 4  | 2 | 0 | 0 |

This is what I'm currently doing. It works fine with smaller files but at around 50,000 rows I run into issues and start getting back server errors.

function get_csv_terms($csvdata){
    $terms = array();
    $csv_rows = count($csvdata);
    $x = 0;
    //get terms
    while($x <= $csv_rows){
        $groupTerm = explode(' ', $csvdata[$x][0]);
        foreach( $groupTerm as $term ){
            if($term != NULL){
                if(!in_array($term, $terms)){
                    $terms[] = $term;
                }
            }
        }
        $x++;
    }

    return $terms;
}

//filter csv and create data for table output
function filter_csv($csvdata){
    $sortedData = array();
    $csv_rows = count($csvdata);
    $terms = get_csv_terms($csvdata); 
    $terms_count = count($terms);
    $x = 0;

    while($x <= $terms_count){
        $y = 0;
        while($y <= $csv_rows){
            $termWords = explode(" ", $csvdata[$y][0]);
            $termWordCount = count($termWords);
            $z = 0;
            while($z <= $termWordCount){
                if($terms[$x] != NULL){
                    if($termWords[$z] == $terms[$x]){
                        $sortedData[$terms[$x]][0] +=  intval($csvdata[$y][1]);
                        $sortedData[$terms[$x]][1] +=  floatval($csvdata[$y][2]);
                        $sortedData[$terms[$x]][2] +=  floatval($csvdata[$y][3]);value
                        $sortedData[$terms[$x]][3] +=  floatval($csvdata[$y][4]);
                    }
                }
                $z++;
            }
            $y++;
        }
        $x++;
    }

    return $sortedData;
}
RamPrakash
  • 1,687
  • 3
  • 20
  • 25
John Nixon
  • 59
  • 3
  • What errors? Memory allocation errors? – jhilgeman Dec 31 '19 at 17:54
  • Also, your sample data indicates you have regular int values for the last 4 columns but your code tries to cast them as floats. – jhilgeman Dec 31 '19 at 17:56
  • Yes, and I've tried upping both the max_execution_time and memory_limit in my php.ini file with no luck – John Nixon Dec 31 '19 at 17:57
  • The last three are floats in the real data, just not in the small example above – John Nixon Dec 31 '19 at 17:58
  • Did you restart the PHP service (for FPM) or web server (for modular PHP) after making the changes in your ini file? Did you validate that the changes took place by looking at the output of phpinfo() ? Did the numbers in the error message change after you made the ini changes? – jhilgeman Dec 31 '19 at 18:03
  • It also looks like you are keeping quite a bit of duplicate data in memory. You have an array of all your CSV data, then the array containing the terms and then an array containing the terms and their count valued. There is a lot of room for optimization. – jhilgeman Dec 31 '19 at 18:13
  • I did restart the service, errors stayed the same. You are correct, there is room for optimization and that is where my knowledge hits a wall hahaha. – John Nixon Dec 31 '19 at 18:40
  • So far I've successfully trimmed out need for the get_csv_terms() and it's produced array. Working to trim further. – John Nixon Dec 31 '19 at 19:38
  • If you are getting a memory error and you've increased the limit in your ini file and the error message is still EXACTLY the same (same numbers) then either you've changed the wrong ini file or there is something else overriding it and setting a new memory limit. Try using ini_get and ini_set to check the current value and then set a higher memory limit in your code and validate that the changes took place. – jhilgeman Dec 31 '19 at 20:02
  • 1
    As a general optimization, try reading the CSV file one record at a time and aggregating the data that way (instead of reading the entire thing into memory and looping through it). – jhilgeman Dec 31 '19 at 20:05
  • Thanks for talking me through this one. I was able to remove the loops above and condense in to reading the CSV one record at a time. It made the process much faster and with no issues. – John Nixon Jan 02 '20 at 14:48

1 Answers1

0

As mentioned in the comment trails, the resolution should be to either increase the memory limit far enough (if possible), or else reduce data duplication and switch to processing the CSV one record at a time to reduce memory usage.

There's also a possibility that using a defined object class could give you better memory efficiency, if not just better code readability. For example, you might try this:

class Term
{
  // Instance properties
  public $UsageCount = 0;
  public $Weight     = 0.00;
  public $Value      = 0.00;
  public $OtherFloat = 0.00;

  // -----------------------

  // Term dictionary
  private static $_terms = array();

  public static Get($term)
  {
    if(!isset($_terms[$term]))
    {
      $_terms[$term] = new Term();
    }
    return $_terms[$term];
  }

  public static GetAll()
  {
    return $_terms;
  }
}

and in your CSV loop:

$termWords = explode(" ", $csvdata[$y][0]);
...
foreach($termWords as $termWord)
{
  $Term = Term::Get($termWord);
  $Term->UsageCount += intval($csvdata[$y][1]);
  $Term->Weight     += floatval($csvdata[$y][2]);
  $Term->Value      += floatval($csvdata[$y][3]);
  $Term->OtherFloat += floatval($csvdata[$y][4]);
}

Recent versions of PHP (7 and later) are great at memory efficiency for objects. Depending on what other properties and code you might have, the object-oriented approach can provide a lot of nice benefits (e.g. automatic pass by reference, easier-to-read-and-maintain code, memory efficiency, clear defaults, data management, etc...).

jhilgeman
  • 1,543
  • 10
  • 27
  • 1
    This older post here might be useful: https://stackoverflow.com/questions/17520093/read-large-data-from-csv-file-in-php Fundamentally, you need an algorithm that processes "a piece at a time" so that it doesn't matter how much input data you have ... only how many keys you have within that data. *Don't* start by "slurping" all of it into a memory-array. – Mike Robinson Jan 02 '20 at 15:45