Fast CSV parsing

Question

I have a java server app that download CSV file and parse it. The parsing can take from 5 to 45 minutes, and happens each hour.This method is a bottleneck of the app so it's not premature optimization. The code so far:

        client.executeMethod(method);
        InputStream in = method.getResponseBodyAsStream(); // this is http stream

        String line;
        String[] record;

        reader = new BufferedReader(new InputStreamReader(in), 65536);

        try {
            // read the header line
            line = reader.readLine();
            // some code
            while ((line = reader.readLine()) != null) {
                 // more code

                 line = line.replaceAll("\"\"", "\"NULL\"");

                 // Now remove all of the quotes
                 line = line.replaceAll("\"", "");     


                 if (!line.startsWith("ERROR"){
                   //bla bla 
                    continue;
                 }

                 record = line.split(",");
                 //more error handling
                 // build the object and put it in HashMap
         }
         //exceptions handling, closing connection and reader

Is there any existing library that would help me to speed up things? Can I improve existing code?

How big is the file? And have you tried profiling your code? This will give you your bottleneck and a clear idea where to improve. I would not be surprised if it is your network being the prime issue. Also have a look at http://commons.apache.org/sandbox/csv/ in stead of building the parser yourself. — joostschouten, Jul 28 '11 at 10:13
I'm profiling right now and I'm aware that big chunk of time is caused by network connection. I want to improve parsing firstly because with network I need to change the architecture. (My estimation that is that faster parsing can improve the loading time by 10-15%). — Lukasz Madon, Jul 28 '11 at 10:21
Sounds fair. Do use a csv parser as these are optimized already and you are bound to run into escaping and i18n issues down the road which you don't want to worry about. Good luck. — joostschouten, Jul 28 '11 at 10:24
Seems that current method is fast enough it takes 2 seconds to parse one file so in total with all files parsing takes less than 1% ;/ — Lukasz Madon, Jul 28 '11 at 10:40
See also http://stackoverflow.com/questions/3908012/parsing-csv-in-java — Raedwald, Apr 03 '14 at 13:15
No idea if it's applicable for your domain, but I would consider shell scripting for basic processing (e.g. `replaceAll` in your code). **SED** / **AWK** are actually designed for this stuff, don't think you can get much faster than that. JVM would then 'only' need to parse the pre-processed data. But again, this might not be applicable. — Jan Groth, Nov 27 '14 at 22:16

score 20 · Accepted Answer · edited Mar 20 '18 at 13:28

20

Apache Commons CSV

Have you seen Apache Commons CSV?

Caveat On Using `split`

Bear in mind is that split only returns a view of the data, meaning that the original line object is not eligible for garbage collection whilst there is a reference to any of its views. Perhaps making a defensive copy will help? (Java bug report)

It also is not reliable in grouping escaped CSV columns containing commas

edited Mar 20 '18 at 13:28

OneCricketeer

179,855
19
132
245

answered Jul 28 '11 at 10:11

Jeff Foster

43,770
11
86
103

2

`String.split()` uses `String.subsring()` which hasn't returned views in a long, long time (https://stackoverflow.com/questions/33893655/string-substring-making-a-copy-of-the-underlying-char-value) – Malt Dec 14 '20 at 07:40

score 13 · Answer 2 · edited Nov 27 '14 at 22:19

13

opencsv

Take a look at opencsv.

This blog post, opencsv is an easy CSV parser, has example usage.

edited Nov 27 '14 at 22:19

Basil Bourque

303,325
100
852
1,154

answered Jul 28 '11 at 10:11

flash

6,730
7
46
70

See comments on [similar sibling answer](http://stackoverflow.com/a/6857291/642706). – Basil Bourque Nov 27 '14 at 22:19

score 6 · Answer 3 · answered Oct 21 '14 at 12:43

The problem of your code is that it's using replaceAll and split which are very costly operation. You should definitely consider using a csv parser/reader that would do a one pass parsing.

There is a benchmark on github

https://github.com/uniVocity/csv-parsers-comparison

that unfortunately is ran under java 6. The number are slightly different under java 7 and 8. I'm trying to get more detail data for different file size but it's work in progress

see https://github.com/arnaudroger/csv-parsers-comparison

score 5 · Answer 4 · answered Jul 28 '11 at 10:36

Apart from the suggestions made above, I think you can try improving your code by using some threading and concurrency.

Following is the brief analysis and suggested solution

From the code it seems that you are reading the data over the network (most possibly apache-common-httpclient lib).
You need to make sure that bottleneck that you are saying is not in the data transfer over the network.
One way to see is just dump the data in some file (without parsing) and see how much does it take. This will give you an idea how much time is actually spent in parsing (when compared to current observation).
Now have a look at how java.util.concurrent package is used. Some of the link that you can use are (1,2)
What you ca do is the tasks that you are doing in for loop can be executed in a thread.
Using the threadpool and concurrency will greatly improve your performance.

Though the solution involves some effort, but at the end this will surly help you.

if the bottleneck is transfert over network you should consider specify gzip header — Xavier Combelle, Aug 30 '11 at 17:01

score 2 · Answer 5 · edited Nov 27 '14 at 22:20

2

opencsv

You should have a look at OpenCSV. I would expect that they have performance optimizations.

edited Nov 27 '14 at 22:20

Basil Bourque

303,325
100
852
1,154

answered Jul 28 '11 at 10:12

Kai

38,985
14
88
103

1

We have very bad experience with opencsv. we found it to be both slow and buggy. ended up wasting half a day, and replacing it alltogether. – Guy May 09 '12 at 19:27
ok... you may want to add more details to make this information relevant. What problems did you have? Which version did you use? Which other framework did you choose? I'm just wondering because I've seen it in more than one project where it did a good job. – Kai May 09 '12 at 20:24
1

The main problem was it was returning the wrong number of fields (i.e. I got a 2 field string[] on a 10 fields line) for certain lines. I never got to understand why it happened, but I am guessing it relates somehow to bad utf-8 parsing. I have replaced it with my own read-line-by-line, String.split each line (I realize there are memory considerations here), which ended up running between 15%-30% faster. I was using opencs v2.3 (java) – Guy May 13 '12 at 08:02

score 1 · Answer 6 · answered Nov 27 '18 at 21:54

1

A little late here, there is now a few benchmarking projects for CSV parsers. Your selection will depend on the exact use-case (i.e. raw data vs data binding etc).

SimpleFlatMapper
uniVocity
sesseltjonna-csv (disclaimer: I wrote this parser)

answered Nov 27 '18 at 21:54

ThomasRS

8,215
5
33
48

Austin Poole · Answer 7 · 2020-10-08T20:53:59.640

Quirk-CSV

The new kid on the block. It uses java annotations and is built on apache-csv which one of the faster libraries out there for csv parsing.

This library is also thread safe as well if you wanted to re-use the CSVProcessor you can and should.

Example:

Pojo

@CSVReadComponent(type = CSVType.NAMED)
@CSVWriteComponent(type = CSVType.ORDER)
public class Pojo {
    @CSVWriteBinding(order = 0)
    private String name;

    @CSVWriteBinding(order = 1)
    @CSVReadBinding(header = "age")
    private Integer age;

    @CSVWriteBinding(order = 2)
    @CSVReadBinding(header = "money")
    private Double money;

    @CSVReadBinding(header = "name")
    public void setA(String name) {
        this.name = name;
    }

    @Override
    public String toString() {

    return "Name: " + name + System.lineSeparator() + "\tAge: " + age + System.lineSeparator() + "\tMoney: "
            + money;
}}

Main

import java.io.IOException;
import java.io.StringReader;
import java.io.StringWriter;
import java.util.*;


public class SimpleMain {
public static void main(String[] args) {
    String csv = "name,age,money" + System.lineSeparator() + "Michael Williams,34,39332.15";

    CSVProcessor processor = new CSVProcessor(Pojo.class);
    List<Pojo> list = new ArrayList<>();
    try {
        list.addAll(processor.parse(new StringReader(csv)));
        list.forEach(System.out::println);

        System.out.println();

        StringWriter sw = new StringWriter();
        processor.write(list, sw);
        System.out.println(sw.toString());
    } catch (IOException e) {
    }


}}

Since this is built on top of apache-csv you can use the powerful tool CSVFormat. Lets say the delimiter for the csv are pipes (|) instead of commas(,) you could for Example:

CSVFormat csvFormat = CSVFormat.DEFAULT.withDelimiter('|');
List<Pojo> list = processor.parse(new StringReader(csv), csvFormat);

Another benefit are inheritance is also consider.

For other examples on handling reading/writing non-primitive data

score 0 · Answer 8 · answered Nov 09 '18 at 12:22

For speed you do not want to use replaceAll, and you don't want to use regex either. What you basically always want to do in critical cases like that is making a state-machine character by character parser. I've done that having rolled the whole thing into an Iterable function. It also takes in the stream and parses it without saving it out or caching it. So if you can abort early that's likely going to go fine as well. It should also be short enough and well coded enough to make it obvious how it works.

public static Iterable<String[]> parseCSV(final InputStream stream) throws IOException {
    return new Iterable<String[]>() {
        @Override
        public Iterator<String[]> iterator() {
            return new Iterator<String[]>() {
                static final int UNCALCULATED = 0;
                static final int READY = 1;
                static final int FINISHED = 2;
                int state = UNCALCULATED;
                ArrayList<String> value_list = new ArrayList<>();
                StringBuilder sb = new StringBuilder();
                String[] return_value;

                public void end() {
                    end_part();
                    return_value = new String[value_list.size()];
                    value_list.toArray(return_value);
                    value_list.clear();
                }

                public void end_part() {
                    value_list.add(sb.toString());
                    sb.setLength(0);
                }

                public void append(int ch) {
                    sb.append((char) ch);
                }

                public void calculate() throws IOException {
                    boolean inquote = false;
                    while (true) {
                        int ch = stream.read();
                        switch (ch) {
                            default: //regular character.
                                append(ch);
                                break;
                            case -1: //read has reached the end.
                                if ((sb.length() == 0) && (value_list.isEmpty())) {
                                    state = FINISHED;
                                } else {
                                    end();
                                    state = READY;
                                }
                                return;
                            case '\r':
                            case '\n': //end of line.
                                if (inquote) {
                                    append(ch);
                                } else {
                                    end();
                                    state = READY;
                                    return;
                                }
                                break;
                            case ',': //comma
                                if (inquote) {
                                    append(ch);
                                } else {
                                    end_part();
                                    break;
                                }
                                break;
                            case '"': //quote.
                                inquote = !inquote;
                                break;
                        }
                    }
                }

                @Override
                public boolean hasNext() {
                    if (state == UNCALCULATED) {
                        try {
                            calculate();
                        } catch (IOException ex) {
                        }
                    }
                    return state == READY;
                }

                @Override
                public String[] next() {
                    if (state == UNCALCULATED) {
                        try {
                            calculate();
                        } catch (IOException ex) {
                        }
                    }
                    state = UNCALCULATED;
                    return return_value;
                }
            };
        }
    };
}

You would typically process this quite helpfully like:

for (String[] csv : parseCSV(stream)) {
    //<deal with parsed csv data>
}

The beauty of that API there is worth the rather cryptic looking function.

Basil Bourque · Answer 9 · 2019-06-06T05:23:33.277

Apache Commons CSV ➙ 12 seconds for million rows

Is there any existing library that would help me to speed up things?

Yes, the Apache Commons CSV project works very well in my experience.

Here is an example app that uses Apache Commons CSV library to write and read rows of 24 columns: An integer sequential number, an Instant, and the rest are random UUID objects.

For 10,000 rows, the writing and the read each take about half a second. The reading includes reconstituting the Integer, Instant, and UUID objects.

My example code lets you toggle on or off the reconstituting of objects. I ran both with a million rows. This creates a file of 850 megs. I am using Java 12 on a MacBook Pro (Retina, 15-inch, Late 2013), 2.3 GHz Intel Core i7, 16 GB 1600 MHz DDR3, Apple built-in SSD.

For a million rows, ten seconds for reading plus two seconds for parsing:

Writing: PT25.994816S
Reading only: PT10.353912S
Reading & parsing: PT12.219364S

Source code is a single .java file. Has a write method, and a read method. Both methods called from a main method.

I opened a BufferedReader by calling Files.newBufferedReader.

package work.basil.example;

import org.apache.commons.csv.CSVFormat;
import org.apache.commons.csv.CSVParser;
import org.apache.commons.csv.CSVPrinter;
import org.apache.commons.csv.CSVRecord;

import java.io.*;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.time.Duration;
import java.time.Instant;
import java.util.UUID;

public class CsvReadingWritingDemo
{
    public static void main ( String[] args )
    {
        CsvReadingWritingDemo app = new CsvReadingWritingDemo();
        app.write();
        app.read();
    }

    private void write ()
    {
        Instant start = Instant.now();
        int limit = 1_000_000; // 10_000  100_000  1_000_000
        Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
        try (
                Writer writer = Files.newBufferedWriter( path, StandardCharsets.UTF_8 );
                CSVPrinter printer = new CSVPrinter( writer , CSVFormat.RFC4180 );
        )
        {
            printer.printRecord( "id" , "instant" , "uuid_01" , "uuid_02" , "uuid_03" , "uuid_04" , "uuid_05" , "uuid_06" , "uuid_07" , "uuid_08" , "uuid_09" , "uuid_10" , "uuid_11" , "uuid_12" , "uuid_13" , "uuid_14" , "uuid_15" , "uuid_16" , "uuid_17" , "uuid_18" , "uuid_19" , "uuid_20" , "uuid_21" , "uuid_22" );
            for ( int i = 1 ; i <= limit ; i++ )
            {
                printer.printRecord( i , Instant.now() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() , UUID.randomUUID() );
            }
        } catch ( IOException ex )
        {
            ex.printStackTrace();
        }
        Instant stop = Instant.now();
        Duration d = Duration.between( start , stop );
        System.out.println( "Wrote CSV for limit: " + limit );
        System.out.println( "Elapsed: " + d );
    }

    private void read ()
    {
        Instant start = Instant.now();

        int count = 0;
        Path path = Paths.get( "/Users/basilbourque/IdeaProjects/Demo/csv.txt" );
        try (
                Reader reader = Files.newBufferedReader( path , StandardCharsets.UTF_8) ;
        )
        {
            CSVFormat format = CSVFormat.RFC4180.withFirstRecordAsHeader();
            CSVParser parser = CSVParser.parse( reader , format );
            for ( CSVRecord csvRecord : parser )
            {
                if ( true ) // Toggle parsing of the string data into objects. Turn off (`false`) to see strictly the time taken by Apache Commons CSV to read & parse the lines. Turn on (`true`) to get a feel for real-world load.
                {
                    Integer id = Integer.valueOf( csvRecord.get( 0 ) ); // Annoying zero-based index counting.
                    Instant instant = Instant.parse( csvRecord.get( 1 ) );
                    for ( int i = 3 - 1 ; i <= 22 - 1 ; i++ ) // Subtract one for annoying zero-based index counting.
                    {
                        UUID uuid = UUID.fromString( csvRecord.get( i ) );
                    }
                }
                count++;
                if ( count % 1_000 == 0 )  // Every so often, report progress.
                {
                    //System.out.println( "# " + count );
                }
            }
        } catch ( IOException e )
        {
            e.printStackTrace();
        }

        Instant stop = Instant.now();
        Duration d = Duration.between( start , stop );
        System.out.println( "Read CSV for count: " + count );
        System.out.println( "Elapsed: " + d );
    }
}

Fast CSV parsing

9 Answers9

Apache Commons CSV

Caveat On Using `split`

opencsv

opencsv

Quirk-CSV

Apache Commons CSV ➙ 12 seconds for million rows

Linked

Related

Fast CSV parsing

9 Answers9

Apache Commons CSV

Caveat On Using split

opencsv

opencsv

Quirk-CSV

Apache Commons CSV ➙ 12 seconds for million rows

Linked

Related

Caveat On Using `split`