5

Introduction

I am building a process to merge a few big sorted csv files. I am currently looking into using Univocity to do this. The way I setup the merge is to use beans that implement comparable interface.

Given

The simplified file looks like this:

id,data
1,aa
2,bb
3,cc

The bean looks like this (getters and setters ommited):

public class Address implements Comparable<Address> {

    @Parsed
    private int id;
    @Parsed
    private String data;        

    @Override
    public int compareTo(Address o) {
        return Integer.compare(this.getId(), o.getId());
    }
}

The comparator looks like this:

public class AddressComparator implements Comparator<Address>{

    @Override
    public int compare(Address a, Address b) {
        if (a == null)
            throw new IllegalArgumentException("argument object a cannot be null");
        if (b == null)
            throw new IllegalArgumentException("argument object b cannot be null");
        return Integer.compare(a.getId(), b.getId());
    }
}

As I do not want to read all the data in memory, I want to read the top record of each file and execute some compare logic. Here is my simplified example:

public class App {
    
    private static final String INPUT_1 = "src/test/input/address1.csv";
    private static final String INPUT_2 = "src/test/input/address2.csv";
    private static final String INPUT_3 = "src/test/input/address3.csv";
    
    public static void main(String[] args) throws FileNotFoundException {       
        BeanListProcessor<Address> rowProcessor = new BeanListProcessor<Address>(Address.class);
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setRowProcessor(rowProcessor);       
        parserSettings.setHeaderExtractionEnabled(true);
        CsvParser parser = new CsvParser(parserSettings);       
        
        List<FileReader> readers = new ArrayList<>();
        readers.add(new FileReader(new File(INPUT_1)));
        readers.add(new FileReader(new File(INPUT_2)));
        readers.add(new FileReader(new File(INPUT_3)));
        
        // This parses all rows, but I am only interested in getting 1 row as a bean.
        for (FileReader fileReader : readers) {
            parser.parse(fileReader);
            List<Address> beans = rowProcessor.getBeans();
            for (Address address : beans) {
                System.out.println(address.toString());
            }           
        }
        
        // want to have a map with the reader and the first bean object
        // Map<FileReader, Address> topRecordofReader = new HashMap<>();
        Map<FileReader, String[]> topRecordofReader = new HashMap<>();
        for (FileReader reader : readers) {
            parser.beginParsing(reader);
            String[] row;
            while ((row = parser.parseNext()) != null) {
               System.out.println(row[0]); 
               System.out.println(row[1]); 
               topRecordofReader.put(reader, row);
               // all done, only want to get first row
               break;        
            }
        }       
    }   
}

Question

Given above example, how do I parse in such a way that it iterates over each row and returns a bean per row, instead of parsing the whole file?

I am looking for something like this (this not working code is just to indicate the kind of solution I am looking for):

for (FileReader fileReader : readers) {
            parser.beginParsing(fileReader);            
            Address bean = null;
            while (bean = parser.parseNextRecord() != null) {
                topRecordofReader.put(fileReader, bean);
            }                       
        }
Community
  • 1
  • 1
Sander_M
  • 1,109
  • 2
  • 18
  • 36

1 Answers1

9

There are two approaches to read iteratively instead of loading everything in memory, the first one is to use a BeanProcessor instead of BeanListProcessor:

settings.setRowProcessor(new BeanProcessor<Address>(Address.class) {
        @Override
        public void beanProcessed(Address address, ParsingContext context) {
            // your code to process the each parsed object here!
        }

To read beans iteratively without a callback (and to perform some other common processes), we created a CsvRoutines class (which extends from AbstractRoutines - more examples here):

    File input = new File("/path/to/your.csv")

    CsvParserSettings parserSettings = new CsvParserSettings();
    //...configure the parser

    // You can also use TSV and Fixed-width routines
    CsvRoutines routines = new CsvRoutines(parserSettings); 
    for (Address address : routines.iterate(Address.class, input, "UTF-8")) {
        //process your bean
    }

Hope this helps!

Jeronimo Backes
  • 6,141
  • 2
  • 25
  • 29
  • Answered by (correct me if I am wrong) the lead developer on Univocity, I love this community. Thank you very much for this detailed and great answer. I am using Univocity parsers more and more in my projects for they are so easy to configure. I am looking forward in coding my little "merge big files fast" project with the use of Univocity. – Sander_M Jun 27 '16 at 22:36
  • You're parser is great, but you should should add the above example of HOW to get a java bean from the help doc where you describe how to build a java bean: https://www.univocity.com/pages/java_beans.html (Also what you call java beans are actually POJOs...) – Nick Nov 04 '18 at 23:50
  • 1
    @nick The page you linked is all about the annotations that can be used by all parsers we make, not only the univocity-parsers. The example showing how to iterate over beans is presented here: https://www.univocity.com/pages/univocity_parsers_routines.html. Also, a POJO is a Java object that doesn't have a requirement to use of particular annotations in order to be compatible with a framework. – Jeronimo Backes Nov 05 '18 at 03:14
  • `CsvRoutines` is thread safe for re-use? – TheRealChx101 Sep 27 '21 at 21:52