Parsing multiple large csv files and adding all the records to ArrayList

Question

Currently I have about 12 csv files, each having about 1.5 million records.

I'm using univocity-parsers as my csv reader/parser library.

Using univocity-parsers, I read each file and add all the records to an arraylist with the addAll() method. When all 12 files are parsed and added to the array list, my code prints the size of the arraylist at the end.

for (int i = 0; i < 12; i++) {
    myList.addAll(parser.parseAll(getReader("file-" + i + ".csv")));

}

It works fine at first until I reach my 6th consecutive file, then it seem to take forever in my IntelliJ IDE output window, never printing out the arraylist size even after an hour, where before my 6th file it was rather fast.

If it helps I'm running on a macbook pro (mid 2014) OSX Yosemite.

It was a textbook problem on forks and joins.

Maybe http://stackoverflow.com/questions/21637293/memory-issue-when-reading-huge-csv-file-store-as-person-objects-write-into-mul?rq=1 — , Sep 13 '15 at 07:28
If you only care about the number of items, as suggested in your question, there's no need to store the whole content im memory. — qqilihq, Sep 13 '15 at 07:33
@qqilihq Hi, that is only my first step, my second step is to infer some statistics from it, such as how many comedy books.etc. It was from a textbook concurrency processing exercise section. — snowpolar, Sep 13 '15 at 07:35

Jeronimo Backes · Answer 1 · 2019-02-20T01:43:01.760

I'm the creator of this library. If you want to just count rows, use a RowProcessor. You don't even need to count the rows yourself as the parser does that for you:

// Let's create our own RowProcessor to analyze the rows
static class RowCount extends AbstractRowProcessor {

    long rowCount = 0;

    @Override
    public void processEnded(ParsingContext context) {
        // this returns the number of the last valid record.
        rowCount = context.currentRecord();
    }
}

public static void main(String... args) throws FileNotFoundException {
    // let's measure the time roughly
    long start = System.currentTimeMillis();

    //Creates an instance of our own custom RowProcessor, defined above.
    RowCount myRowCountProcessor = new RowCount();

    CsvParserSettings settings = new CsvParserSettings();


    //Here you can select the column indexes you are interested in reading.
    //The parser will return values for the columns you selected, in the order you defined
    //By selecting no indexes here, no String objects will be created
    settings.selectIndexes(/*nothing here*/);

    //When you select indexes, the columns are reordered so they come in the order you defined.
    //By disabling column reordering, you will get the original row, with nulls in the columns you didn't select
    settings.setColumnReorderingEnabled(false);

    //We instruct the parser to send all rows parsed to your custom RowProcessor.
    settings.setRowProcessor(myRowCountProcessor);

    //Finally, we create a parser
    CsvParser parser = new CsvParser(settings);

    //And parse! All rows are sent to your custom RowProcessor (CsvDimension)
    //I'm using a 150MB CSV file with 3.1 million rows.
    parser.parse(new File("c:/tmp/worldcitiespop.txt"));

    //Nothing else to do. The parser closes the input and does everything for you safely. Let's just get the results:
    System.out.println("Rows: " + myRowCountProcessor.rowCount);
    System.out.println("Time taken: " + (System.currentTimeMillis() - start) + " ms");

}

Output

Rows: 3173959
Time taken: 1062 ms

Edit: I saw your comment regarding your need to use the actual data in the rows. In this case, process the rows in the rowProcessed() method of the RowProcessor class, that's the most efficient way to handle this.

Edit 2:

If you want to just count rows use getInputDimension from CsvRoutines:

    CsvRoutines csvRoutines = new CsvRoutines();
    InputDimension d = csvRoutines.getInputDimension(new File("/path/to/your.csv"));
    System.out.println(d.rowCount());
    System.out.println(d.columnCount());

score 0 · Answer 2 · answered Sep 13 '15 at 10:28

In parseAll they use 10000 elements for preallocation.

/**
 * Parses all records from the input and returns them in a list.
 *
 * @param reader the input to be parsed
 * @return the list of all records parsed from the input.
 */
public final List<String[]> parseAll(Reader reader) {
    List<String[]> out = new ArrayList<String[]>(10000);
    beginParsing(reader);
    String[] row;
    while ((row = parseNext()) != null) {
        out.add(row);
    }
    return out;
}

If you have millions of records (lines in file I guess) it is not good for performance and memory allocation because it will double the size and copy when allocate new space.

You could try to implement your own parseAll method like this:

public List<String[]> parseAll(Reader reader, int numberOfLines) {
    List<String[]> out = new ArrayList<String[]>(numberOfLines);
    parser.beginParsing(reader);
    String[] row;
    while ((row = parser.parseNext()) != null) {
        out.add(row);
    }
    return out;
}

And check if it helps.

In fact, the preallocated number of lines is not exactly a problem here as the ArrayList expands very quickly. The problem is that his VM didn't have memory enough to keep everything in memory. — Jeronimo Backes, Oct 15 '15 at 11:06

score 0 · Answer 3 · answered Sep 13 '15 at 12:08

The problem is that you are running out of memory. When this happens, the computer begins to crawl, since it starts to swap memory to disk, and viceversa.

Reading the whole contents into memory is definitely not the best strategy to follow. And since you are only interested in calculating some statistics, you do not even need to use addAll() at all.

The objective in computer science is always to meet an equilibrium between memory spent and execution speed. You can always deal with both concepts, trading memory for more speed or speed for memory savings.

So, loading the whole files into memory is comfortable for you, but not a solution, not even in the future, when computers will include terabytes of memory.

public int getNumRecords(CsvParser parser, int start) {
    int toret = start;

    parser.beginParsing(reader);
    while (parser.parseNext() != null) {
        ++toret;
    }

    return toret;
}

As you can see, there is no memory spent in this function (except each single row); you can use it inside a loop for your CSV files, and finish with the total count of rows. The next step is to create a class for all your statistics, substituting that int start with your object.

class Statistics {
   public Statistics() {
       numRows = 0;
       numComedies = 0;
   }

   public countRow() {
       ++numRows;
   }

   public countComedies() {
        ++numComedies;
   }

   // more things...
   private int numRows;
   private int numComedies;
}

public int calculateStatistics(CsvParser parser, Statistics stats) {
    int toret = start;

    parser.beginParsing(reader);
    while (parser.parseNext() != null) {
        stats.countRow();
    }

    return toret;
}

Hope this helps.

Parsing multiple large csv files and adding all the records to ArrayList

3 Answers3