5

I want to parse the large CSV file as fast and efficient as possible.

Currently, I am using the openCSV library to parse my CSV file but it is taking approx 10sec to parse a CSV file which has 10776 records with 24 headings and I want to parse a CSV file with millions of records.

<dependency>
  <groupId>com.opencsv</groupId>
  <artifactId>opencsv</artifactId>
  <version>4.1</version>
</dependency>

I am using the openCSV library parsing using below code snippet.

public List<?> convertStreamtoObject(InputStream inputStream, Class clazz) throws IOException {
        HeaderColumnNameMappingStrategy ms = new HeaderColumnNameMappingStrategy();
        ms.setType(clazz);
        Reader reader = new InputStreamReader(inputStream);

        CsvToBean cb = new CsvToBeanBuilder(reader)
                .withType(clazz)
                .withMappingStrategy(ms)
                .withSkipLines(0)
                .withSeparator('|')
                .withFieldAsNull(CSVReaderNullFieldIndicator.EMPTY_SEPARATORS)
                .withThrowExceptions(true)
                .build();
        List<?> parsedData = cb.parse();
        inputStream.close();
        reader.close();
        return parsedData;
    }

I am looking for suggestions for another way to parse a CSV file with millions of records in less time frame.

--- updated the answer ----

 Reader reader = new InputStreamReader(in);
        CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT
                .withFirstRecordAsHeader()
                .withDelimiter('|')
                .withIgnoreHeaderCase()
                .withTrim());
        List<CSVRecord> recordList = csvParser.getRecords();
        for (CSVRecord csvRecord : recordList) {
             csvRecord.get("headername");
         }
David
  • 507
  • 2
  • 6
  • 14
  • Try `BufferedInputStreamReader` – K.Nicholas Jun 05 '19 at 05:29
  • @K.Nicholas I’m very sure that openCSV is smart enough to enable buffering one way or another if needed. – Holger Jun 05 '19 at 09:32
  • @Holger - not me that asked the question. – K.Nicholas Jun 05 '19 at 12:31
  • 2
    @K.Nicholas but you are the one who supposed to use `BufferedInputStreamReader`, which doesn’t gain anything, unless you assume that openCSV fails to enable buffering on its own. I [just looked it up](https://sourceforge.net/p/opencsv/source/ci/master/tree/src/main/java/com/opencsv/CSVReader.java#l108), `this.br = (reader instanceof BufferedReader ? (BufferedReader) reader : new BufferedReader(reader));`, so the OP doesn’t need to test with any buffered stream or reader, openCSV does already do that… – Holger Jun 05 '19 at 13:23
  • @Holger - great, helpful answer. What do you think about processing 10K records in 10 seconds? Probably 1K records per second is for a single thread is reasonable though it seems a little slow. General answer for David then is to either test another library or try to handle the slow parts in parallel. Standard stuff. – K.Nicholas Jun 05 '19 at 13:29
  • 1
    @K.Nicholas what is better, letting the OP try something that’s predictably no solution, or no answer at all? I don’t know, whether a better performance is possible in the OP’s case and where the bottleneck lies. That’s what profiling tools are for. Perhaps, it’s not the I/O but the Reflection magic that converts the CSV lines to instances of the `Class` argument. Perhaps, a different library performs better. Not enough information to answer that. The only thing that can be said for sure, is that additional buffering won’t help. – Holger Jun 05 '19 at 13:34
  • @holger, yea, got that. Like I said, not my question. Perhaps you should have left the exercise of looking up the buffering aspect to the OP. You want to spend all day responding to off the cuff suggestions help yourself. – K.Nicholas Jun 05 '19 at 13:38
  • @K.Nicholas I think, there are enough exercises left to the OP. If the OP wants to go the route of implementing a CSV parser manually, it could use pattern matching based on [this answer](https://stackoverflow.com/a/52062570/2711488). But due to the amount of corner cases and testing overhead, the usual advice is to try existing libraries first… – Holger Jun 05 '19 at 13:51
  • @Holger - exactly. – K.Nicholas Jun 05 '19 at 13:51
  • Thanks, Holger, and K.Nicholas so detailed discussion on this. Can you please suggest a different library which you are aware to improve the performance for parsing large CSV file. Just to add I want to keep the order of records like that is there in CSV. – David Jun 05 '19 at 23:23
  • @K.Nicholas You either mean `BufferedInputStream` or `BufferedReader`. There is no such class as `BufferedInputStreamReader`. – user207421 Jun 06 '19 at 04:22
  • @OP You can read millions of lines per second with `BufferedReader`, so the library you are using must be the bottleneck here. – user207421 Jun 06 '19 at 04:24
  • Possible duplicate of [Fast CSV parsing](https://stackoverflow.com/questions/6857248/fast-csv-parsing) – Basil Bourque Jun 06 '19 at 05:05
  • 1
    I added [an Answer](https://stackoverflow.com/a/56471153/642706) to [this original](https://stackoverflow.com/q/6857248/642706) of your duplicate Question. I used *Apache Commons CSV* to write and read/parse a million rows. The rows were similar to what you describe: 24 columns of an integer, an `Instant`, and 22 `UUID` columns as canonical hex strings. Takes 10 seconds to merely read the 850 meg file, and another two to parse the cell values back to objects. Doing ten thousand took about half a second versus the 10 seconds your reported, a time savings of 20-fold faster. – Basil Bourque Jun 06 '19 at 05:09
  • Thank you @BasilBourque, I have tried to use Apache Commons CSV library and the performance is amazing. I am able to parse 10k records in 300 ms. I have updated the question with the answer as well. Can you please have a look. I have a question that for parsing we have to do csvRecord.get("HeadingName") after getting data or there is some annotation like we have in openCSV to bind csv data to bean. Along with that if I have to have field value as null if the separators are empty, do I have add check like csvRecord.get("HeadingName") == "" ? null : csvRecord.get("HeadingName") – David Jun 06 '19 at 07:16
  • @David Regarding how Stack Overflow works: (a) You should post your own Answer rather than put a solution inside your Question. (b) To ask another question, post another Question. – Basil Bourque Jun 06 '19 at 14:37

1 Answers1

0

Answer

Reader reader = new InputStreamReader(in);
        CSVParser csvParser = new CSVParser(reader, CSVFormat.DEFAULT
                .withFirstRecordAsHeader()
                .withDelimiter('|')
                .withIgnoreHeaderCase()
                .withTrim());
        List<CSVRecord> recordList = csvParser.getRecords();
        for (CSVRecord csvRecord : recordList) {
             csvRecord.get("headername");
         }
David
  • 507
  • 2
  • 6
  • 14