1

I am working on CSV parser requirement and I am using supercsv parser library. My CSV file can have 25 columns(separated by tab(|)) and up to 100k rows with additional header row.

I would like to ignore white-space only lines and lines containing less than 25 columns.

I am using IcvBeanReader with name mappings(to set csv values to pojo) and field processors(to handle validations) for reading a file.

I am assuming that Supercsv IcvBeanReader will skip white space lines by default. But how to handle if a row contains less than 25 column numbers?

James Z
  • 12,209
  • 10
  • 24
  • 44
Kiran
  • 45
  • 6
  • What is a lakh? Why are you writing the word tab but the pipe character (|)? – Mad Physicist Jan 23 '16 at 02:16
  • This looks like a simpler version of [this question](http://stackoverflow.com/a/11698684/1068649), but instead of reading the lines with incorrect columns, you're ignoring them – James Bassett Jan 24 '16 at 23:03

2 Answers2

1

You can easily do this by writing your own Tokenizer.

For example, the following Tokenizer will have the same behaviour as the default one, but will skip over any lines that don't have the correct number of columns.

public class SkipBadColumnCountTokenizer extends Tokenizer {

    private final int expectedColumns;

    private final List<Integer> ignoredLines = new ArrayList<>();

    public SkipBadColumnCountTokenizer(Reader reader, 
            CsvPreference preferences, int expectedColumns) {
        super(reader, preferences);
        this.expectedColumns = expectedColumns;
    }

    @Override
    public boolean readColumns(List<String> columns) throws IOException {
        boolean moreInputExists;
        while ((moreInputExists = super.readColumns(columns)) && 
            columns.size() != this.expectedColumns){
            System.out.println(String.format("Ignoring line %s with %d columns: %s", getLineNumber(), columns.size(), getUntokenizedRow()));
            ignoredLines.add(getLineNumber());
        }

        return moreInputExists;

    }

    public List<Integer> getIgnoredLines(){
        return this.ignoredLines;
    }
}

And a simple test using this Tokenizer...

@Test
public void testInvalidRows() throws IOException {

    String input = "column1,column2,column3\n" +
            "has,three,columns\n" +
            "only,two\n" +
            "one\n" +
            "three,columns,again\n" +
            "one,too,many,columns";

    CsvPreference preference = CsvPreference.EXCEL_PREFERENCE;
    int expectedColumns = 3;
    SkipBadColumnCountTokenizer tokenizer = new SkipBadColumnCountTokenizer(
        new StringReader(input), preference, expectedColumns);

    try (ICsvBeanReader beanReader = new CsvBeanReader(tokenizer, preference)) {
        String[] header = beanReader.getHeader(true);
        TestBean bean;
        while ((bean = beanReader.read(TestBean.class, header)) != null){
            System.out.println(bean);
        }
        System.out.println(String.format("Ignored lines: %s", tokenizer.getIgnoredLines()));
    }

}

Prints the following output (notice how it's skipped all of the invalid rows):

TestBean{column1='has', column2='three', column3='columns'}
Ignoring line 3 with 2 columns: only,two
Ignoring line 4 with 1 columns: one
TestBean{column1='three', column2='columns', column3='again'}
Ignoring line 6 with 4 columns: one,too,many,columns
Ignored lines: [3, 4, 6]
James Bassett
  • 9,458
  • 4
  • 35
  • 68
  • Thanks a lot.its working fine int expectedColumns = 25; Tokenizer tokenizer = new SkipBadColumnCountTokenizer(new FileReader(file), new CsvPreference.Builder('"', '|', "\n").build(), expectedColumns); beanReader = new CsvBeanReader(tokenizer, new CsvPreference.Builder('"', '|', "\n").build()); Is there any way to capture row numbers containing bad columns ? – Kiran Jan 25 '16 at 10:56
  • Sure, you can call `getLineNumber()` in the Tokenizer - either log it or collect it into a List and retrieve it at the end. I've updated the example above to demonstrate logging. – James Bassett Jan 25 '16 at 12:51
  • Thanks again.this method is returning boolean and also its overridden method of Tokenizer . Could you please help how to capture these bad columns record numbers to a List . I need to have that list at TestInvalidRows(which you have written above) class level . Once again I am very happy for your support since after looking at your custom CSV exception handler in other post ..I managed to handle field level validation errors without throwing exceptions. – Kiran Jan 25 '16 at 13:43
  • No worries, I've just updated the example Tokenizer to demonstrate one way of capturing the line numbers as well. – James Bassett Jan 25 '16 at 23:24
  • Thanq very much for your support on this :) – Kiran Jan 26 '16 at 05:21
0

(1) If the selection must be done by your Java program using Super CSV, then (and I quote) "you'll have to use CsvListReader". In particular: listReader.length()

See this Super CSV page for details.

(2) If you can perform the selection by preprocessing the CSV file, then you might wish to consider a suitable command-line tool (or tools, depending on how complicated the CSV format is). If the delimiter of the CSV file does not occur within any field, then awk would suffice. For example, if the assumption is satisfied, and if the delimiter is |, then the relevant awk filter could be as simple as:

awk -F'|' 'NF == 25 {print}'

If the CSV file format is too complex for a naive application of awk, then you may wish to convert the complex format to a simpler one; often TSV has much to recommend it.

peak
  • 105,803
  • 17
  • 152
  • 177