2

Good morning, I´m actually going through some hard lessons while trying to handle huge csv files up to 4GB.

Goal is to search some items in a csv file (Amazon datafeed) by a given browsenode and also by some given item id´s (ASIN). To get a mix of existing items (in my database) plus some additional new itmes since from time to time items disapear on the marketplace. I also filter the title of the items because there are many items using the same.

I have been reading here lots af tips and finally decided to use php´s fgetcsv() and thought this function will not exhaust memory, since it reads the file line by line. But no matter what I try I´m always running out of memory. I can not understand why my code uses so much memory.

I set the memory limit to 4096MB, time limit is 0. Server has 64 GB Ram and two SSD hardisks.

May someone please check out my piece of code and explain how it is possible that im running out of memory and more important how memory is used?

private function performSearchByASINs()
{
    $found = 0;
    $needed = 0;
    $minimum = 84;
    if(is_array($this->searchASINs) && !empty($this->searchASINs))
    {
        $needed = count($this->searchASINs);
    }
    if($this->searchFeed == NULL || $this->searchFeed == '')
    {
        return false;
    }
    $csv = fopen($this->searchFeed, 'r');
    if($csv)
    {
        $l = 0;
        $title_array = array();
        while(($line = fgetcsv($csv, 0, ',', '"')) !== false)
        {
            $header = array();
            if(trim($line[6]) != '')
            {
                if($l == 0)
                {
                    $header = $line;
                }
                else
                {
                    $asin = $line[0];
                    $title = $this->prepTitleDesc($line[6]);
                    if(is_array($this->searchASINs) 
                    && !empty($this->searchASINs) 
                    && in_array($asin, $this->searchASINs)) //search for existing items to get them updated
                    {
                        $add = true;
                        if(in_array($title, $title_array))
                        {
                            $add = false; 
                        }
                        if($add === true)
                        {
                            $this->itemsByASIN[$asin] = new stdClass();
                            foreach($header as $k => $key)
                            {
                                if(isset($line[$k]))
                                {
                                    $this->itemsByASIN[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));
                                }
                            }
                            $title_array[] = $title;
                            $found++;
                        }
                    }
                    if(($line[20] == $this->bnid || $line[21] == $this->bnid) 
                    && count($this->itemsByKey) < $minimum 
                    && !isset($this->itemsByASIN[$asin])) // searching for new items
                    {
                        $add = true;
                        if(in_array($title, $title_array))
                        {
                           $add = false;
                        }
                        if($add === true)
                        {
                            $this->itemsByKey[$asin] = new stdClass();
                            foreach($header as $k => $key)
                            {
                                if(isset($line[$k]))
                                {
                                    $this->itemsByKey[$asin]->$key = trim(strip_tags($line[$k], '<br><br/><ul><li>'));                                
                                }
                            }
                            $title_array[] = $title;
                            $found++;
                        }
                    }
                }
                $l++;
                if($l > 200000 || $found == $minimum)
                {
                    break;
                }
            }
        }
        fclose($csv);
    }
}
Thomas Tonius
  • 97
  • 1
  • 9

3 Answers3

2

I know my answer is a bit late but I had a similar problem with fgets() and things based on fgets() like SplFileObject->current() function. In my case it was on a windows system when trying to read a +800MB file. I think fgets() doesn't free the memory of the previous line in a loop. So every line that was read stayed in memory and let to a fatal out of memory error. I fixed it using fread($lineLength) instead but it is a bit trickier since you must supply the length.

lorenzobe
  • 74
  • 7
0

It is very hard to manage large data using array without encountering timeout issue. Instead why not parse this datafeed to a database table and do the heavy lifting from there.

MACMAN
  • 1,883
  • 1
  • 21
  • 35
  • Well. I simly thought the performance over all will be better not using a database and instead of this using plain csv files, since Amazon requires to update the data on a regular basis (at least every 24 hours), which means to compare the basefeed with updatefeeds which can come up every 30 minutes – Thomas Tonius Sep 17 '15 at 04:46
  • If possible try splitting the large file into multiple files. Search for tools that can accomplish that – MACMAN Sep 17 '15 at 05:00
  • Hmm, I still hope if someone can explain how in detail memory is used it can be handled within one file. I have already 82 files to handle. – Thomas Tonius Sep 17 '15 at 05:14
  • May be this thread will help you. http://stackoverflow.com/questions/5249279/file-get-contents-php-fatal-error-allowed-memory-exhausted/5249971#5249971 – MACMAN Sep 17 '15 at 10:23
0

Have you tried this? SplFileObject::fgetcsv

<?php
$file = new SplFileObject("data.csv");
while (!$file->eof()) {
    //your code here
}
?>

You are running out of memory because you use variables, and you are never doing an unset(); and use too many nested foreach. You could shrink that code in more functions A solution should be, use a real Database instead.

Elias Nicolas
  • 775
  • 13
  • 26
  • I do not want to use a database, since I have to update these feeds on a regulary basis. I use three arrays which collect data and can contain max 200 values (each), The fourth array is $header which is reseted for each line and I don´t reset variables since they are reseted after every line read in the while loop. But thanks for posting the SplFileObject - I´m going to read the manual – Thomas Tonius Sep 17 '15 at 04:53
  • I have tried this, but doesn´t help - but thanks anyway! – Thomas Tonius Sep 17 '15 at 08:18