0

I am using the function CSVToArray().

<script type="text/javascript">
    // ref: http://stackoverflow.com/a/1293163/2343
    // This will parse a delimited string into an array of
    // arrays. The default delimiter is the comma, but this
    // can be overriden in the second argument.
    function CSVToArray( strData, strDelimiter ){
        // Check to see if the delimiter is defined. If not,
        // then default to comma.
        strDelimiter = (strDelimiter || ",");
 
        // Create a regular expression to parse the CSV values.
        var objPattern = new RegExp(
            (
                // Delimiters.
                "(\\" + strDelimiter + "|\\r?\\n|\\r|^)" +
 
                // Quoted fields.
                "(?:\"([^\"]*(?:\"\"[^\"]*)*)\"|" +
 
                // Standard fields.
                "([^\"\\" + strDelimiter + "\\r\\n]*))"
            ),
            "gi"
            );
 
 
        // Create an array to hold our data. Give the array
        // a default empty first row.
        var arrData = [[]];
 
        // Create an array to hold our individual pattern
        // matching groups.
        var arrMatches = null;
 
 
        // Keep looping over the regular expression matches
        // until we can no longer find a match.
        while (arrMatches = objPattern.exec( strData )){
 
            // Get the delimiter that was found.
            var strMatchedDelimiter = arrMatches[ 1 ];
 
            // Check to see if the given delimiter has a length
            // (is not the start of string) and if it matches
            // field delimiter. If id does not, then we know
            // that this delimiter is a row delimiter.
            if (
                strMatchedDelimiter.length &&
                strMatchedDelimiter !== strDelimiter
                ){
 
                // Since we have reached a new row of data,
                // add an empty row to our data array.
                arrData.push( [] );
 
            }
 
            var strMatchedValue;
 
            // Now that we have our delimiter out of the way,
            // let's check to see which kind of value we
            // captured (quoted or unquoted).
            if (arrMatches[ 2 ]){
 
                // We found a quoted value. When we capture
                // this value, unescape any double quotes.
                strMatchedValue = arrMatches[ 2 ].replace(
                    new RegExp( "\"\"", "g" ),
                    "\""
                    );
 
            } else {
 
                // We found a non-quoted value.
                strMatchedValue = arrMatches[ 3 ];
 
            }
 
 
            // Now that we have our value string, let's add
            // it to the data array.
            arrData[ arrData.length - 1 ].push( strMatchedValue );
        }
 
        // Return the parsed data.
        return( arrData );
    }
 
</script>

The problem is I have a CSV file of size 40 Mb with around 300 000 rows. After parsing the CSV to a String buffer with comma delimiter( using Node.js), I pipe this buffer to this function and it never reads beyond row 96 000.

This is a JS out of memory problem;

What is the right technique/algorithm to use for this function so that it could build an array out of any CSV file of any size?
Like for example an algorithm usually used to split big files/data into subfiles/chunks?

Community
  • 1
  • 1
ccot
  • 1,875
  • 3
  • 36
  • 54
  • 1
    There is no limit to the number of properties that an object can have. The limit for the Array [*length*](http://ecma-international.org/ecma-262/5.1/#sec-15.4.5.2) property is `2^32 - 1`, which is a long way beyond 96,000. That doesn't limit the highest index though. – RobG Aug 29 '14 at 02:51

1 Answers1

0

MY SOLUTION:

( hope it helps those who have the same scenario )

  • I used formidable to upload file (it's amazingly fast for large files)
  • I parse the file on server to a string buffer while dividing the file to separate chunks of data.
  • Each chunk is sent to a Node JS event (powerful feature in Node) to be converted to JSON array (using function above, although i modified it a bit) separately, ie parallel processing of chunks

This solved the issue and enhanced speed and performance.

ccot
  • 1,875
  • 3
  • 36
  • 54