How to extract a very large csv file, transform it and then load it into another file using Java?

Question

I have a traditional file system. The data file is of size 4 GB and looks like

ID, Name, Role, Dept

1, Patrick, 2, 3

2, Emanuel, 1, 5

3, Mike, 1, 5

I have to transform the file into:

ID, Name, Role, Dept

1, Patrick, Operator, Grinding

2, Emanuel, Assistant, HR

3, Mike, Assistant, HR

The above data needs to be saved to another file.

What is the best approach to avoid OutOfMemory Exception?

I understand that I need to user something like Scanner to read the file but how to store the intermediary outputs(transformed data) in a Map like object which would be increasing enormously by adding up each row data?

Read the file line by line and write to the new file line by line at the same time. That way you don't need to store more than a line at a time in memory. — assylias, Mar 09 '18 at 12:38
Why do you want to put it in a map ? Do you want to arrange the entries by Role ? — Alexander Petrov, Mar 09 '18 at 12:40
You can use a [RandomAccessFile](https://docs.oracle.com/javase/7/docs/api/java/io/RandomAccessFile.html) to read your file in parts, make the manipulation and store. You can also split your file into smaller files and use multithreading to achieve your goal. — Skynet, Mar 09 '18 at 12:41
https://stackoverflow.com/questions/2356137/read-large-files-in-java may this link help you — Jay Shankar Gupta, Mar 09 '18 at 12:42
I just typed an answer that involves creating a 2nd file from a 1st file line by line. It's nice and short and uses the proper Java 8 features: https://stackoverflow.com/questions/49193228/counting-number-of-words-start-with-uppercase-letter-in-strings-java/49193955#49193955 — Dreamspace President, Mar 09 '18 at 12:54
@assylias- The way you mentioned will have enormous amount of I/O hits which is not at all a good approach. — Parijat Bose, Mar 13 '18 at 09:49
@ParijatBose given you use some sensible buffering-mechanism you will not have more (physical) IOs with assylias' approach than when trying to write the whole file at once. And it is the way to go if you are worried about memory-consumption. — piet.t, Mar 13 '18 at 14:00

score 0 · Answer 1 · answered Mar 09 '18 at 12:41

0

If you really need to use Java, then try to load it into a H2 database:

CREATE TABLE TEST AS SELECT * FROM CSVREAD('test.csv');

You can then use SQL to transform the table and write it to another CSV:

CALL CSVWRITE('test2.csv', 'SELECT * FROM TEST2');

More info: http://www.h2database.com/html/tutorial.html#csv

answered Mar 09 '18 at 12:41

activout.se

6,058
4
27
37

1

You are not serious. To involve a whole database for such a simple task :) – Alexander Petrov Mar 09 '18 at 12:42
And h2 is not even so good in importing large files. Over certain size I always got Out of memory errors. – takacsot Sep 11 '19 at 20:38
@takacsot Give more memory to your Java process then :) – activout.se Sep 12 '19 at 09:20
1

I could not (hmm I do not want) to add more (especially that sqlite is handling that without any issue or significant memory usage :) It is imply an implementation mistake what is kind of "blocker" for me to use in production. I would be happy if someone would revisit that code in h2. – takacsot Sep 12 '19 at 11:44

score 0 · Answer 2 · answered Mar 13 '18 at 02:41

Use a proper (and fast) CSV parser. With univocity-parsers the entire process should take some seconds.

First create a RowProcessor that will receive each row parsed from the input, transform it and write the result to a given output.

public RowProcessor createProcessor(final File output){
    CsvWriterSettings outputSettings = new CsvWriterSettings();
    //configure the CSV writer - format and other settings.

    //create a writer for the output you want with the given settings. 
    final CsvWriter writer = new CsvWriter(output, "UTF-8", outputSettings);        

    return new com.univocity.parsers.common.processor.RowProcessor(){
        private Map<String, String> roleMap;
        private Map<String, String> deptMap;

        @Override
        public void processStarted(ParsingContext context) {
            roleMap = buildMapOfRoles();
            deptMap = buildMapOfDepartments();
        }

        @Override
        public void rowProcessed(String[] row, ParsingContext context) {
            row[2] = roleMap.get(row[2]);
            row[3] = deptMap.get(row[3]);

            writer.writeRow(row);
        }

        @Override
        public void processEnded(ParsingContext context) {
            writer.close();
        }
    };
}

Then run the parser with this:

String encoding = "UTF-8";
File input = new File("/path/to/input.csv");
File output = new File("/path/to/output.csv");

RowProcessor processor = createProcessor(output, encoding);
CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setProcessor(processor);
//configure the parser settings as needed.

//then run the parser. It will submit all rows to the processor created above.
new CsvParser(parserSettings).parse(input, encoding);

All rows will be submitted to your processor and write the transformed row directly to the output

Here is my amazing implementation of buildMapOfRoles and buildMapOfDepartments:

private Map<String, String> buildMapOfRoles(){
    Map<String,String> out = new HashMap<>();
    out.put("2", "Operator");
    out.put("1", "Assistant");
    return out;
}

private Map<String, String> buildMapOfDepartments(){
    Map<String,String> out = new HashMap<>();
    out.put("3", "Grinding");
    out.put("5", "HR");
    return out;
}

This will produce the exact output you expect. Hope this helps

Disclaimer: I'm the author of this library. It's open source and free (Apache 2.0 license)

How to extract a very large csv file, transform it and then load it into another file using Java?

2 Answers2