31

I have large Excel worksheets that I want to be able to read into MySQL using PHPExcel.

I am using the recent patch which allows you to read in Worksheets without opening the whole file. This way I can read one worksheet at a time.

However, one Excel file is 27MB large. I can successfully read in the first worksheet since it is small, but the second worksheet is so large that the cron job that started the process at 22:00 was not finished at 8:00 AM, the worksheet is simple too big.

Is there any way to read in a worksheet line by line, e.g. something like this:

$inputFileType = 'Excel2007';
$inputFileName = 'big_file.xlsx';
$objReader = PHPExcel_IOFactory::createReader($inputFileType);
$worksheetNames = $objReader->listWorksheetNames($inputFileName);

foreach ($worksheetNames as $sheetName) {
    //BELOW IS "WISH CODE":
    foreach($row = 1; $row <=$max_rows; $row+= 100) {
        $dataset = $objReader->getWorksheetWithRows($row, $row+100);
        save_dataset_to_database($dataset);
    }
}

Addendum

@mark, I used the code you posted to create the following example:

function readRowsFromWorksheet() {

    $file_name = htmlentities($_POST['file_name']);
    $file_type = htmlentities($_POST['file_type']);

    echo 'Read rows from worksheet:<br />';
    debug_log('----------start');
    $objReader = PHPExcel_IOFactory::createReader($file_type);
    $chunkSize = 20;
    $chunkFilter = new ChunkReadFilter();
    $objReader->setReadFilter($chunkFilter);

    for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
        $chunkFilter->setRows($startRow, $chunkSize);
        $objPHPExcel = $objReader->load('data/' . $file_name);
        debug_log('reading chunk starting at row '.$startRow);
        $sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true);
        var_dump($sheetData);
        echo '<hr />';
    }
    debug_log('end');
}

As the following log file shows, it runs fine on a small 8K Excel file, but when I run it on a 3 MB Excel file, it never gets past the first chunk, is there any way I can optimize this code for performance, otherwise it doesn't look like it is not performant enough to get chunks out of a large Excel file:

2011-01-12 11:07:15: ----------start
2011-01-12 11:07:15: reading chunk starting at row 2
2011-01-12 11:07:15: reading chunk starting at row 22
2011-01-12 11:07:15: reading chunk starting at row 42
2011-01-12 11:07:15: reading chunk starting at row 62
2011-01-12 11:07:15: reading chunk starting at row 82
2011-01-12 11:07:15: reading chunk starting at row 102
2011-01-12 11:07:15: reading chunk starting at row 122
2011-01-12 11:07:15: reading chunk starting at row 142
2011-01-12 11:07:15: reading chunk starting at row 162
2011-01-12 11:07:15: reading chunk starting at row 182
2011-01-12 11:07:15: reading chunk starting at row 202
2011-01-12 11:07:15: reading chunk starting at row 222
2011-01-12 11:07:15: end
2011-01-12 11:07:52: ----------start
2011-01-12 11:08:01: reading chunk starting at row 2
(...at 11:18, CPU usage at 93% still running...)

Addendum 2

When I comment out:

//$sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, true);
//var_dump($sheetData);

Then it parses at an acceptable speed (about 2 rows per second), is there anyway to increase the performance of toArray()?

2011-01-12 11:40:51: ----------start
2011-01-12 11:40:59: reading chunk starting at row 2
2011-01-12 11:41:07: reading chunk starting at row 22
2011-01-12 11:41:14: reading chunk starting at row 42
2011-01-12 11:41:22: reading chunk starting at row 62
2011-01-12 11:41:29: reading chunk starting at row 82
2011-01-12 11:41:37: reading chunk starting at row 102
2011-01-12 11:41:45: reading chunk starting at row 122
2011-01-12 11:41:52: reading chunk starting at row 142
2011-01-12 11:42:00: reading chunk starting at row 162
2011-01-12 11:42:07: reading chunk starting at row 182
2011-01-12 11:42:15: reading chunk starting at row 202
2011-01-12 11:42:22: reading chunk starting at row 222
2011-01-12 11:42:22: end

Addendum 3

This seems to work adequately, for instance, at least on the 3 MB file:

for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
    echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ', $startRow, ' to ', ($startRow + $chunkSize - 1), '<br />';
    $chunkFilter->setRows($startRow, $chunkSize);
    $objPHPExcel = $objReader->load('data/' . $file_name);
    debug_log('reading chunk starting at row ' . $startRow);
    foreach ($objPHPExcel->getActiveSheet()->getRowIterator() as $row) {
        $cellIterator = $row->getCellIterator();
        $cellIterator->setIterateOnlyExistingCells(false);
        echo '<tr>';
        foreach ($cellIterator as $cell) {
            if (!is_null($cell)) {
                //$value = $cell->getCalculatedValue();
                $rawValue = $cell->getValue();
                debug_log($rawValue);
            }
        }
    }
}
pnuts
  • 58,317
  • 11
  • 87
  • 139
Edward Tanguay
  • 189,012
  • 314
  • 712
  • 1,047
  • The var_dump of $sheetData was only in my code snippet to demonstrate how the chunking works, probably not something you'd need in a "real world" use. The rangeToArray() method that I'm currently adding to the Worksheet class would also be more efficient than the toArray() method if you did need to do a worksheet data dump. – Mark Baker Jan 12 '11 at 11:53
  • @Edward Tanguay hi, did you find any solution/alternative for this ? I am having the same problem – Deepanshu Goyal Oct 11 '13 at 09:07
  • 2
    An alternative to PHPExcel is the open source library [Spout](https://github.com/box/spout). It supports reading and writing of huge files, and does not require more than 10MB of memory. And it's super fast! – Adrien Feb 09 '15 at 01:09
  • How do you determine what the value for "240" should be if the number of rows in the spreadsheet is unknown? – juuga Aug 20 '15 at 14:56
  • @Edward Tanguay I know it's now a while you posted this, but would it be possible for you to post the entire code? – Yuri Mar 03 '16 at 07:17
  • @Adrien your suggestion looks amazing, better than any answer down here – Decebal Oct 13 '16 at 15:07

3 Answers3

12

It is possible to read a worksheet in "chunks" using Read Filters, although I can make no guarantees about efficiency.

$inputFileType = 'Excel5';
$inputFileName = './sampleData/example2.xls';


/**  Define a Read Filter class implementing PHPExcel_Reader_IReadFilter  */
class chunkReadFilter implements PHPExcel_Reader_IReadFilter
{
    private $_startRow = 0;

    private $_endRow = 0;

    /**  Set the list of rows that we want to read  */
    public function setRows($startRow, $chunkSize) {
        $this->_startRow    = $startRow;
        $this->_endRow        = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '') {
        //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {
            return true;
        }
        return false;
    }
}


echo 'Loading file ',pathinfo($inputFileName,PATHINFO_BASENAME),' using IOFactory with a defined reader type of ',$inputFileType,'<br />';
/**  Create a new Reader of the type defined in $inputFileType  **/

$objReader = PHPExcel_IOFactory::createReader($inputFileType);



echo '<hr />';


/**  Define how many rows we want to read for each "chunk"  **/
$chunkSize = 20;
/**  Create a new Instance of our Read Filter  **/
$chunkFilter = new chunkReadFilter();

/**  Tell the Reader that we want to use the Read Filter that we've Instantiated  **/
$objReader->setReadFilter($chunkFilter);

/**  Loop to read our worksheet in "chunk size" blocks  **/
/**  $startRow is set to 2 initially because we always read the headings in row #1  **/

for ($startRow = 2; $startRow <= 240; $startRow += $chunkSize) {
    echo 'Loading WorkSheet using configurable filter for headings row 1 and for rows ',$startRow,' to ',($startRow+$chunkSize-1),'<br />';
    /**  Tell the Read Filter, the limits on which rows we want to read this iteration  **/
    $chunkFilter->setRows($startRow,$chunkSize);
    /**  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  **/
    $objPHPExcel = $objReader->load($inputFileName);

    //    Do some processing here

    $sheetData = $objPHPExcel->getActiveSheet()->toArray(null,true,true,true);
    var_dump($sheetData);
    echo '<br /><br />';
}

Note that this Read Filter will always read the first row of the worksheet, as well as the rows defined by the chunk rule.

When using a read filter, PHPExcel still parses the entire file, but only loads those cells that match the defined read filter, so it only uses the memory required by that number of cells. However, it will parse the file multiple times, once for each chunk, so it will be slower. This example reads 20 rows at a time: to read line by line, simply set $chunkSize to 1.

This can also cause problems if you have formulae that reference cells in different "chunks", because the data simply isn't available for cells outside of the current "chunk".

Mark Baker
  • 209,507
  • 32
  • 346
  • 385
  • I'm trying your code in a test file, but it tells me `Class 'ChunkReadFilter' not found`. If I take off the `implements PHPExcel_Reader_IReadFilter` then it finds the class and tells me I need to `must implement interface PHPExcel_Reader_IReadFilter`, I put at the beginning of my file `require_once 'PHPExcelClasses/PHPExcel/Reader/IReadFilter.php'` and `require_once 'PHPExcelClasses/PHPExcel/Reader/IReader.php'` but it still can't find the class if I implement this interface, is there some other file I need to include? – Edward Tanguay Jan 12 '11 at 09:40
  • just had to put the `ChunkReadFilter` class above the main code, works now, thanks – Edward Tanguay Jan 12 '11 at 09:48
  • I tried that code in a test (posted above). Although it works on a small file (8K), it doesn't seem to get past the first chunk on a 3 MB file. – Edward Tanguay Jan 12 '11 at 10:22
  • It will work on any size file, but it won't necessarily be quick. Each "chunk" iteration still needs to read and parse the entire file, which does take time... the larger the "chunks" you can work with, the fewer iterations, the fewer times it needs to read/parse the file. It's a trade-off between memory usage and speed of execution. – Mark Baker Jan 12 '11 at 11:50
  • 1
    I've reduced the chunk size to 1 row but even then on the largest worksheet in the 27MB Excel file after 50 seconds I get `Out of memory (allocated 1538523136)`, I'm setting the memory limit to pretty much the max: `ini_set("memory_limit", "3700M");`. I'm using my last code example above (Addendum 3) so that I know it is not calculating the cells but just giving me the raw value. Is there any other way I can prevent the use of so much memory so that it can at least read one row at a time? – Edward Tanguay Jan 12 '11 at 13:12
  • I don't know what's eating all your memory. The whole file is parsed in memory; but that shouldn't take 3.7GB. And the PHPExcel object that it builds should be very tiny when it only contains 1 worksheet/row. I've been running tests on a 64k row x 17 column file (23MB in size, though filesize isn't overly meaningful as a measure of required memory). It takes about 9 minutes to read each "chunk" of 256 rows, but does so in less than 512M (my memory_limit setting). If the technique that should work (and that works in my local test setup) isn't working, I have absolutely no idea what you can do. – Mark Baker Jan 12 '11 at 13:33
  • A worksheet with dimensions `25000 x GP` (in a 27MB file) after 50 seconds gets the error: `Out of memory (allocated 1538523136)`. – Edward Tanguay Jan 12 '11 at 13:34
  • A worksheet with dimensions `40000 x GA` (in a 84MB file) after 90 seconds gets the error: `(null)Entity: line 2: parser error : out of memory error in /.../PHPExcel/Reader/Excel2007.php on line 561` – Edward Tanguay Jan 12 '11 at 13:36
  • A worksheet with dimensions `970 x EU` (in the 84MB file) is read in fine in 03:50. – Edward Tanguay Jan 12 '11 at 13:46
  • So only those two large worksheets get memory errors, I looked through them and they don't seem to have anything out of the ordinary in their cells, just: numbers, text, and some formulas e.g. CONCATENATE, VLOOKUP, and SUM, but I explicitly don't execute those anyway as shown in the "Addendum 3" code above. – Edward Tanguay Jan 12 '11 at 13:53
  • ok, thanks for your help, I'm going to now process these with VBA and C# to see if they have any problems, perhaps there is some kind of bad data in these worksheets which causes abnormal memory use in other libraries as well... – Edward Tanguay Jan 12 '11 at 13:56
  • You could try a simple for loop through the rows with a getCell() call, rather than using the iterators: there have been a couple of problems reported with the row/column iterators; and I've noticed that the row iterator works on a range of row 1 to the getHighestRow() value for the worksheet. It could be that this is forcing the creation of empty cells to replace those that weren't created by the load process. – Mark Baker Jan 12 '11 at 14:11
  • If you don't set $cellIterator->setIterateOnlyExistingCells(false) you'll improve speed and memory usage. With that line, the iterator is processing all rows and cells from the worksheet (even if they weren't loaded). – Mark Baker Jan 12 '11 at 16:09
  • 1
    Executing $objPHPExcel->disconnectWorksheets(); unset($objPHPExcel); after the "foreach" loop should also help free up any memory leaks from the PHPExcel object in each "chunk" iteration. – Mark Baker Jan 12 '11 at 16:16
  • Great, I'll try that with setIterateOnlyExistingCells, but will it still look past blank rows and columns? – Edward Tanguay Jan 12 '11 at 16:16
  • 1
    Interesting: The two large files (27MB and 84MB) are .xlsx files. I saved the 27MB file as a Excel2004 (Mac) and was able to read it in with Excel5 format no problem. I'll try this with the 84MB file as well. – Edward Tanguay Jan 12 '11 at 16:16
  • $cellIterator->setIterateOnlyExistingCells(true) will still read all the cells that have been loaded. It just won't read cells that don't exist. setIterateOnlyExistingCells(true) is the default behaviour – Mark Baker Jan 12 '11 at 16:18
  • Guys, as we all know xlsx is zip archive with xml, right? I have xlsx 3 mb, but if we rename 1.xlsx to 1.zip, and unzip them we will see direcory with xmls. Then go to xl/worksheets and i see there 18mb file, which load via simplexml_load_file in PHPExcel code... Chunk filter with xlsx have no effect, because big xml files needs read with XmlReader or something else streaming xml reader – Andrey Vorobyev Dec 08 '14 at 17:21
  • You're right that chunk reading won't prevent the entire XML file being loaded by the load process because SimpleXML always loads the entire file; but chunk reading does restrict the size of the PHPExcel object that is populated from that XML file, so it does have an effect – Mark Baker Dec 08 '14 at 17:28
5

Currently to read .xlsx, .csv and .ods the best option is spreadsheet-reader (https://github.com/nuovo/spreadsheet-reader) because it can read the files without loading it all into memory. For the .xls extension it has limitations because it uses the PHPExcel for reading.

Leonardo Delfino
  • 1,488
  • 10
  • 20
  • Probably you will want to head up to this https://github.com/box if you run into this problem https://github.com/nuovo/spreadsheet-reader/issues/59 with nuovo. At the time of writing box tool didnt' exist yet ) – Tebe Mar 21 '18 at 11:29
1

This is the ChunkReadFilter.php :

<?php
Class ChunkReadFilter implements PHPExcel_Reader_IReadFilter {

    private $_startRow = 0;
    private $_endRow = 0;

    /**  Set the list of rows that we want to read  */
    public function setRows($startRow, $chunkSize) {
        $this->_startRow = $startRow;
        $this->_endRow = $startRow + $chunkSize;
    }

    public function readCell($column, $row, $worksheetName = '') {

        //  Only read the heading row, and the rows that are configured in $this->_startRow and $this->_endRow 
        if (($row == 1) || ($row >= $this->_startRow && $row < $this->_endRow)) {

            return true;
        }
        return false;
    }

}
?>

And this is the index.php and a not perfect but basic implementation at the end of this file.

<?php

require_once './Classes/PHPExcel/IOFactory.php';
require_once 'ChunkReadFilter.php';

class Excelreader {

    /**
     * This function is used to read data from excel file in chunks and insert into database
     * @param string $filePath
     * @param integer $chunkSize
     */
    public function readFileAndDumpInDB($filePath, $chunkSize) {
        echo("Loading file " . $filePath . " ....." . PHP_EOL);
        /**  Create a new Reader of the type that has been identified  * */
        $objReader = PHPExcel_IOFactory::createReader(PHPExcel_IOFactory::identify($filePath));

        $spreadsheetInfo = $objReader->listWorksheetInfo($filePath);

        /**  Create a new Instance of our Read Filter  * */
        $chunkFilter = new ChunkReadFilter();

        /**  Tell the Reader that we want to use the Read Filter that we've Instantiated  * */
        $objReader->setReadFilter($chunkFilter);
        $objReader->setReadDataOnly(true);
        //$objReader->setLoadSheetsOnly("Sheet1");
        //get header column name
        $chunkFilter->setRows(0, 1);
        echo("Reading file " . $filePath . PHP_EOL . "<br>");
        $totalRows = $spreadsheetInfo[0]['totalRows'];
        echo("Total rows in file " . $totalRows . " " . PHP_EOL . "<br>");

        /**  Loop to read our worksheet in "chunk size" blocks  * */
        /**  $startRow is set to 1 initially because we always read the headings in row #1  * */
        for ($startRow = 1; $startRow <= $totalRows; $startRow += $chunkSize) {
            echo("Loading WorkSheet for rows " . $startRow . " to " . ($startRow + $chunkSize - 1) . PHP_EOL . "<br>");
            $i = 0;
            /**  Tell the Read Filter, the limits on which rows we want to read this iteration  * */
            $chunkFilter->setRows($startRow, $chunkSize);
            /**  Load only the rows that match our filter from $inputFileName to a PHPExcel Object  * */
            $objPHPExcel = $objReader->load($filePath);
            $sheetData = $objPHPExcel->getActiveSheet()->toArray(null, true, true, false);

            $startIndex = ($startRow == 1) ? $startRow : $startRow - 1;
            //dumping in database
            if (!empty($sheetData) && $startRow < $totalRows) {
                /**
                 * $this->dumpInDb(array_slice($sheetData, $startIndex, $chunkSize));
                 */

                echo "<table border='1'>";
                foreach ($sheetData as $key => $value) {
                    $i++;
                    if ($value[0] != null) {
                        echo "<tr><td>id:$i</td><td>{$value[0]} </td><td>{$value[1]} </td><td>{$value[2]} </td><td>{$value[3]} </td></tr>";
                    }
                }
                echo "</table><br/><br/>";
            }
            $objPHPExcel->disconnectWorksheets();
            unset($objPHPExcel, $sheetData);
        }
        echo("File " . $filePath . " has been uploaded successfully in database" . PHP_EOL . "<br>");
    }

    /**
     * Insert data into database table 
     * @param Array $sheetData
     * @return boolean
     * @throws Exception
     * THE METHOD FOR THE DATABASE IS NOT WORKING, JUST THE PUBLIC METHOD..
     */
    protected function dumpInDb($sheetData) {

        $con = DbAdapter::getDBConnection();
        $query = "INSERT INTO employe(name,address)VALUES";

        for ($i = 1; $i < count($sheetData); $i++) {
            $query .= "(" . "'" . mysql_escape_string($sheetData[$i][0]) . "',"
                    . "'" . mysql_escape_string($sheetData[$i][1]) . "')";
        }

        $query = trim($query, ",");
        $query .="ON DUPLICATE KEY UPDATE name=VALUES(name),
                =VALUES(address),
               ";
        if (mysqli_query($con, $query)) {
            mysql_close($con);
            return true;
        } else {
            mysql_close($con);
            throw new Exception(mysqli_error($con));
        }
    }

    /**
     * This function returns list of files corresponding to given directory path
     * @param String $dataFolderPath
     * @return Array list of file
     */
    protected function getFileList($dataFolderPath) {
        if (!is_dir($dataFolderPath)) {
            throw new Exception("Directory " . $dataFolderPath . " is not exist");
        }
        $root = scandir($dataFolderPath);
        $fileList = array();
        foreach ($root as $value) {
            if ($value === '.' || $value === '..') {
                continue;
            }
            if (is_file("$dataFolderPath/$value")) {
                $fileList[] = "$dataFolderPath/$value";
                continue;
            }
        }
        return $fileList;
    }

}

$inputFileName = './prueba_para_batch.xls';
$excelReader = new Excelreader();
$excelReader->readFileAndDumpInDB($inputFileName, 500);
Antoine Galluet
  • 320
  • 4
  • 14
Andres Paladines
  • 1,142
  • 14
  • 19