How to parse huge csv file efficiently in java

Question

My application is currently using CSV Parser to parse csv files and persist to database. It loads the entire csv into memory and taking a lot of time to persist , sometimes even times out. I have seen on the site
seeing mixed recommendations to use Univocity parser. Please advice the best approach to process large amounts of data which takes less time.
Thank you.

Code:

 int numRecords = csvParser.parse( fileBytes );

  public int parse(InputStream ins) throws ParserException {
    long parseTime=  System.currentTimeMillis();
    fireParsingBegin();
    ParserEngine engine = null;
    try {
        engine = (ParserEngine) getEngineClass().newInstance();
    } catch (Exception e) {
        throw new ParserException(e.getMessage());
    }
    engine.setInputStream(ins);
    engine.start();
    int count = parse(engine);
    fireParsingDone();
    long seconds = (System.currentTimeMillis() - parseTime) / 1000;
    System.out.println("Time taken is "+seconds);
    return count;
}


protected int parse(ParserEngine engine) throws ParserException {
    int count = 0;
    while (engine.next()) //valuesString Arr in Engine populated with cell data
    {
        if (stopParsing) {
            break;
        }

        Object o = parseObject(engine); //create individual Tos
        if (o != null) {
            count++; //count is increased after every To is formed
            fireObjectParsed(o, engine); //put in into Bo/COl and so valn preparations
        }
        else {
            return count;
        }
    }
    return count;

There are different ways to read a file which performance is commented in [this other SO question](https://stackoverflow.com/questions/4716503/reading-a-plain-text-file-in-java/40597140#40597140). — Serg M Ten, Oct 29 '18 at 15:13
Depends on the application.. I would think that in most situations the bottle neck would be pushing the data to persistence rather than reading from a csv file. Given that the file is huge, you may want to only partially load the csv data into memory to ensure that you are not memory bound. — flakes, Oct 29 '18 at 15:14
“It loads the entire csv into memory” ← That is the cause of your problem. Don’t do that. Parse each line after reading it. The whole point of InputStreams and Readers is having manageable amounts of data in memory. — VGR, Oct 29 '18 at 15:51
Thank you for the response.I have updated the question with mycode. We are converting into filebytes and calling the parse(byte bytes[]). Do I need to change my implementation here? Any sample code that you can refer to? — StarFish, Oct 29 '18 at 16:12
Is there a way to send file bytes in chunks in java for parsing? — StarFish, Oct 29 '18 at 16:29

Jeronimo Backes · Accepted Answer · 2018-10-30T01:58:26.353

2

univocity-parsers is your best bet on loading the CSV file, you probably won't be able to hand code anything faster. The problems you are having come from possibly 2 things:

1 - loading everything in memory. That's generally a bad design decision, but if you do that make sure to have enough memory allocated for your application. Give it more memory using flags -Xms8G and Xmx8G for example.

2 - you are probably not batching your insert statements.

My suggestion is to try this (using univocity-parsers):

    //configure input format using
    CsvParserSettings settings = new CsvParserSettings();

    //get an interator
    CsvParser parser = new CsvParser(settings);
    Iterator<String[]> it = parser.iterate(new File("/path/to/your.csv"), "UTF-8").iterator();

    //connect to the database and create an insert statement
    Connection connection = getYourDatabaseConnectionSomehow();
    final int COLUMN_COUNT = 2;
    PreparedStatement statement = connection.prepareStatement("INSERT INTO some_table(column1, column2) VALUES (?,?)"); 

    //run batch inserts of 1000 rows per batch
    int batchSize = 0;
    while (it.hasNext()) {
        //get next row from parser and set values in your statement
        String[] row = it.next(); 
        for(int i = 0; i < COLUMN_COUNT; i++){ 
            if(i < row.length){
                statement.setObject(i + 1, row[i]);
            } else { //row in input is shorter than COLUMN_COUNT
                statement.setObject(i + 1, null);   
            }
        }

        //add the values to the batch
        statement.addBatch();
        batchSize++;

        //once 1000 rows made into the batch, execute it
        if (batchSize == 1000) {
            statement.executeBatch();
            batchSize = 0;
        }
    }
    // the last batch probably won't have 1000 rows.
    if (batchSize > 0) {
        statement.executeBatch();
    }

This should execute pretty quickly and you won't need not even 100mb of memory to run.

For the sake of clarity, I didn't use any try/catch/finally block to close any resources here. Your actual code must handle that.

Hope it helps.

edited Oct 30 '18 at 01:58

answered Oct 30 '18 at 01:53

Jeronimo Backes

6,141
2
25
29

Thank you Jeronimo. The application is already using -Xms8G and Xmx8G. I will try using the batch implementation you suggested. Thank you much again for the inputs. – StarFish Oct 30 '18 at 13:20
Hi Jeronimo, I looked at the code and we are using CSVParser and Parseorbserver and it takes 1 sec for each row in a csv file to parse and validate. But for a file which has 120k records it takes about 1 hr plus to finish uploading to database because it is always processed in serial way. Can you suggest ways to implement this in parallel. – StarFish Nov 10 '18 at 22:09
Also to add my application is using -Xms8G and -Xms24G – StarFish Nov 10 '18 at 22:14
Univocity parsers doesn't have a parserobserver class. Are you using the right lib? – Jeronimo Backes Nov 11 '18 at 00:02
You should process 120k records in less than 2 seconds (csv) and need another 10 seconds max to insert it all in the database. – Jeronimo Backes Nov 11 '18 at 00:03
Thank you for your prompt reply. The application is not using Univocity parser. In the code I posted in this question has the parsing and validation logic. This is where most of the processing time is going. Looking for something that we can tweek around with the same logic that can improve performance? We earlier had the batch size as 30,000 and changingng that to 1000 has given us significant improvement. Any other suggestions please? – StarFish Nov 11 '18 at 01:06
Try using univocity parsers for parsing your file. If you created your own in house parser it is very likely it is slow. Parsing csv is not as simple as it initially looks. – Jeronimo Backes Nov 11 '18 at 03:15
Instead of changing the parser, Jeronimo is there a way I can kick off multiple batches in parallel after we load in memory so that there is parallel processing happening, whats the approach? any sample code that can help. my batch size is 1000, for 120k file can we have like 12 batches kicked off?. Also, I can update the code with the parser and validationmlogic if you can suggest for any other improvements. Thanks much again – StarFish Nov 11 '18 at 13:08
How long does your current parser take to load everything in memory? Concurrent batches against the same table will be slower than a single process writing to the table alone. The database will likely make the other processes wait, or queue each process data to then execute the insertion sequentially. You might get some speed by writing into different tables. But then you'll have much more complexity at the database level. – Jeronimo Backes Nov 11 '18 at 16:43
The loading to memory is immediate. but the parsing and validation each row is the total processing time which is where is the bottle neck. Persisting to db also happens in a few secs. – StarFish Nov 11 '18 at 17:09
We need to write it to one output table. on the db end, we made this change on the primary key cache on table to 10000 shown to improve performance. ID_DET BIGINT GENERATED ALWAYS AS IDENTITY ( START WITH 1, INCREMENT BY 1, NO CYCLE, MINVALUE 1, MAXVALUE 9223372036854775807, CACHE 10000, NO ORDER ), – StarFish Nov 11 '18 at 17:12
So it seems your parser is the only bottleneck left. Any reason in particular you can't use a lib instead? Did you run any test to see if using a lib can improve things? – Jeronimo Backes Nov 11 '18 at 17:19
which lib are you referring to? – StarFish Nov 11 '18 at 17:32
@StarFish a csv parsing lib such as univocity parsers. – Jeronimo Backes Nov 11 '18 at 17:51
Thanks Jeronimo. I will try looking how to implement this with existing framework. One question that I have is is there a way I can kick off multiple batches in parallel after we load in memory so that there is parallel processing happening, whats the approach. You did mention this Concurrent batches against the same table will be slower than a single process writing to the table alone. But we have ETL processses that right concurrently to one table which is not an issue on db end. looking for sample code/approach to help implementing the parallel kick off after loading to mem. Please advice. – StarFish Nov 12 '18 at 16:58

score 1 · Answer 2 · answered Oct 29 '18 at 16:14

1

Use the Commons CSV Library by Apache.

answered Oct 29 '18 at 16:14

Jerin Joseph

1,087
9
17

score 0 · Answer 3 · answered Mar 01 '22 at 14:51

0

Streaming with Apache Commons IO

try (LineIterator it = FileUtils.lineIterator(theFile, "UTF-8")) {
    while (it.hasNext()) {
        String line = it.nextLine();
        // do something with line
    }
}

answered Mar 01 '22 at 14:51

Sergey Nemchinov

1,348
15
21

How to parse huge csv file efficiently in java

3 Answers3