Fastest way to read a CSV?

Question

I've profiled my application and it seems like one of my biggest bottlenecks at the moment is the String.split method. It's taking up 21% of my runtime, and the other big contributors aren't parts that I can streamline anymore than they are. It also seems like all of the newly-created String objects are causing issues with the garbage collector, although I'm less clear whether or not that's the case.

I'm reading in a gzipped file comma-separated values that contain financial data. The number of fields in each row varies depending on what kind of record it is, and the size of each field varies too. What's the fastest way to read the data in, creating the fewest intermediate objects?

I saw this thread but none of the answers give any evidence that OpenCSV is any faster than String.split, and they all seem to focus on using an external library rather than writing new code. I'm also very concerned about memory overhead, because I spend another 20% or so of the total runtime doing garbage collection. I would like to just return views of the string in question, but it looks like that's not possible anymore.

your data will be always integers? if your data contains strings your code would fails if it contains strings like "Hello, world!" — Marco Acierno, Aug 03 '14 at 17:45
@MarcoAcierno No, they're mostly not integers. Some strings, some floats, some integers. — Patrick Collins, Aug 03 '14 at 20:01
The problem is that with Split you will split strings too. (You should be sure that there strings will not have , inside) — Marco Acierno, Aug 03 '14 at 20:12

score 1 · Accepted Answer · answered Aug 03 '14 at 17:23

1

A quicker way is to use just a simple StringTokenizer. It doesn't have the regex overhead of split() and it's in the JDK.

answered Aug 03 '14 at 17:23

Kayaman

72,141
5
83
121

score 0 · Answer 2 · answered Aug 03 '14 at 17:37

If you do not want to use a library, then an alternative to StringTokenizer would be to write a simple state machine to parse your CSV. Tokenizers can have problems with commas embedded in fields. CSV is a reasonably simple format, so it is not difficult to build a state machine to handle it. If you know exactly what the format of the input file is, then you can simplify it even further since you will not have to deal with any possibilities not present in your specific file.

Numeric data could potentially be converted direct to int on the fly, without having to hold a large number of strings simultaneously.

score 0 · Answer 3 · edited Dec 10 '15 at 15:35

Use uniVocity-parsers to parse your CSV file. It is suite of parsers for tabular text formats and its CSV parser is the fastest among all other parsers for Java (as you can see here, and here). Disclosure: I am the author of this library. It's open-source and free (Apache V2.0 license).

We used the architecture provided by this framework to build a custom parser for MySQL dump files for this project. We managed to parse a 42GB dump file in 15 minutes (1+ billion rows).

It should solve your problem.

Fastest way to read a CSV?

3 Answers3