Line breaks in field treated as end of line while parsing csv file

Question

IN a csv file that I have a record that renders like this:

,"SKYY SPA MARTINI

 2 oz. SKYY Vodka
 Fresh cucumber
 Fresh mint
 Splash of simple syrup

 Muddle cucumber & mint with syrup.
 Add SKYY Vodka and shake with ice. 
 Strain into a chilled martini glass. 
 Garnish with a fresh mint sprig and cucumber slice.",

with each line ending with a LF carriage return.

I thought that this would be treated as a string and the carriage returns wouldn't be treated as new lines, but this isn't the case, and is breaking my script. Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes? I'm currently using this as my code, couldn't find a setting for the tokenizer that would allow me to perform this action.

        // instantiate description line mapper
    DelimitedLineTokenizer lineTokenizer = new DelimitedLineTokenizer();
    DefaultLineMapper<LCBOProduct> lineMapper = new DefaultLineMapper<>();

    lineMapper.setLineTokenizer(lineTokenizer);
    lineMapper.setFieldSetMapper(fieldSetMapper);

    // set description line mapper
    reader.setLineMapper(lineMapper);

    return reader;

`Is there a way to have the reader only have line breaks parsed if they're not flanked by quotes?` - I believe the convention is to use a double quote around the entire text for that column. I don't see the double quotes in your posted example. If your parser doesn't support this then maybe try CSVParser. — camickr, Aug 26 '18 at 00:38
Parsing CSV has some "interesting" corner cases. There is OpenCSV and Apache's CSVParser. Use one of these, they are well debugged. Also, the data you show is NOT CSV. At a minimum, if you have a multiline field it must be in quotes. If now, how would you distinguish a delimiting comma from a comma in the data? — Jim Garrison, Aug 26 '18 at 00:59
My apologies, I forgot to add the quotes that do exist for strings in this file. Also when debugging, it does state the quotation character as " — canadiancreed, Aug 26 '18 at 15:03

score 1 · Accepted Answer · answered Aug 26 '18 at 01:42

Inspired by this CSV regex post, I have written a quick-and-dirty method for doing this:

public static void main(String[] args) {
    String line = "\"BEEP\",\"BOOP\",\"TWO SHOTS\rOF VODKA\"\r\"BOOP\",\"BEEP\",\"LEMON\rWEDGES\"";

    String quote = "\"";
    String splitter = "\r";
    String delimiter = ",";

    parse(line, delimiter, quote, splitter);
}

public static void parse(String data, String delimiter, String quote, String splitter) {
    String regex = splitter+"(?=(?:[^"+quote+"]*\"[^"+quote+"]*\")*[^"+quote+"]*$)";

    String[] lines = data.split(regex, -1);

    List<String[]> records = new ArrayList<String[]>();

    for(String line : lines) {
        records.add(line.split(delimiter, -1));
    }

    for(String[] line : records) {
        for(String record : line) {
            System.out.println("RECORD: " + record); //do whatever
        }
    }
}

Of course, considering the large size of some CSV files, you will need to chug along with a StringBuilder and likely use myStringBuilder.toString().split(regex, -1); for the parse method.

This is likely not the Spring way of doing things. But as Jim Garrison commented, this is an edge case that I'm not sure if Spring has ways of solving.

A more complex regex may be required if the records start using other nasty characters (commas, quotes, etc.). I don't know what the source of these records could be, but some sanitizing may be in order before splitting the file.

Line breaks in field treated as end of line while parsing csv file

1 Answers1