5

I am writing code to process a list of tar.gz files, inside which there are multiple, csv files. I have encountered the error below

com.opencsv.exceptions.CsvMalformedLineException: Unterminated quoted field at end of CSV line. Beginning of lost text: [,,,,,,
]
    at com.opencsv.CSVReader.primeNextRecord(CSVReader.java:245)
    at com.opencsv.CSVReader.flexibleRead(CSVReader.java:598)
    at com.opencsv.CSVReader.readNext(CSVReader.java:204)
    at uk.ac.shef.inf.analysis.Test.readAllLines(Test.java:64)
    at uk.ac.shef.inf.analysis.Test.main(Test.java:42)

And the code causing this problem is below, on line B.

public class Test {
    public static void main(String[] args) {
        try {
            Path source = Paths.get("/home/xxxx/Work/data/amazon/labelled/small/Books_5.json.1.tar.gz");
            InputStream fi = Files.newInputStream(source);
            BufferedInputStream bi = new BufferedInputStream(fi);
            GzipCompressorInputStream gzi = new GzipCompressorInputStream(bi);
            TarArchiveInputStream ti = new TarArchiveInputStream(gzi);
            CSVParser parser = new CSVParserBuilder().withStrictQuotes(true)
                    .withQuoteChar('"').withSeparator(',').
                    .withEscapeChar('|').           // Line A
                     build();
            BufferedReader br = null;
            ArchiveEntry entry;
            entry = ti.getNextEntry();
            while (entry != null) {
                br = new BufferedReader(new InputStreamReader(ti)); // Read directly from tarInput
                System.out.format("\n%s\t\t  > %s", new Date(), entry.getName());
                try{
                    CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
                            .build();
                    List<String[]> r = readAllLines(reader);
                } catch (Exception ioe){
                    ioe.printStackTrace();
                }
                System.out.println(entry.getName());
                entry=ti.getNextEntry();        // Line B
            }
        }catch (Exception e){
            e.printStackTrace();
        }
    }

    private static List<String[]> readAllLines(CSVReader reader) {
        List<String[]> out = new ArrayList<>();
        int line=0;
        try{
            String[] lineInArray = reader.readNext();

            while(lineInArray!=null) {
                //System.out.println(Arrays.asList(lineInArray));
                out.add(lineInArray);
                line++;
                lineInArray=reader.readNext();
            }
        }catch (Exception e){
            System.out.println(line);
            e.printStackTrace();
        }
        System.out.println(out.size());
        return out;
    }
}

I also attach a screenshot of the actual line within the csv file that caused this problem here, look at line 5213. I also include a test tar.gz file here: https://drive.google.com/file/d/1qHfWiJItnE19-BFdbQ3s3Gek__VkoUqk/view?usp=sharing

enter image description here

While debugging, I have some questions.

  • I think the issue is the \ character in the data file (line 5213 above), which is the escape character in Java. I verified this idea by adding line A to my code above, and it works. However, obviously I don't want to hardcode this as there can be other characters in the data causing same issue. So my question 1 is: is there anyway to tell Java to ignore escape characters? Something like the opposite of withEscapeChar('|')? UPDATE: the answer is to use '\0', thanks to the first comment below.
  • When debugging, I notice that my program stops working on the next .csv file within the tar.gz file as soon as it hit the above exception. To explain what I mean, inside the tar.gz file included in the above link, there are two csvs: _10.csv and _110.csv. The problematic line is in _10.csv. When my program hit that line, an exception is thrown and the program moves on to the next file _110.csv (entry=ti.getNextEntry();). This file is actually fine, but the method readAllLines that is supposed to read this next csv file will throw the same exception immediately on the first line. I don't think my code is correct, especially the while loop: I suspect the input stream was still stuck at the previous position that caused the exception. But I don't know how to fix this. Help please?
Ziqi
  • 2,445
  • 5
  • 38
  • 65
  • 2
    Did you try any other escape-char like NUL char `'\0'` recommended in [similar question](https://stackoverflow.com/questions/6008395/opencsv-in-java-ignores-backslash-in-a-field-value) ? – hc_dev Feb 03 '22 at 19:01
  • You'll also have to determine what character the provider of your data uses for escaping embedded quotes in a string. The standard way to handle embedded double-quotes in CSV is to use two double-quote characters in succession, i.e. `"String containing "" a double quote"`. This is not technically an escape character in the same manner as the Java backslash as it only applies to the double quote character and is not a general escape. – Jim Garrison Feb 03 '22 at 19:06
  • 2
    Most recommend the [`RFC4180Parser`](http://opencsv.sourceforge.net/apidocs/com/opencsv/RFC4180Parser.html) to solve the escaping-backslash issue, like explained in DZone : [OpenCSV: Properly Handling Backslashes](https://dzone.com/articles/properly-handling-backslashes-using-opencsv). – hc_dev Feb 03 '22 at 19:06
  • 1
    @hc_dev I just tried, and it works, thanks! I updated my post. Still want to know the answer to the second questoin, as I think my while loop is wrong... – Ziqi Feb 03 '22 at 19:07
  • It doesn't look like you're closing (or using try-with-resources for ) the BufferedReader and/or the other input streams. You might need to do some maintenance on that reader in the catch clause before changing state on the 'parent' resource manager `ti` – Gus Feb 03 '22 at 19:16
  • @Gus not sure that is the case, when I added 'reader.close()' within the try block, I get 'java.lang.NullPointerException at org.apache.commons.compress.compressors.gzip.GzipCompressorInputStream.read(GzipCompressorInputStream.java:299)', when moving to the next file inside the tar – Ziqi Feb 03 '22 at 22:03

1 Answers1

2

using RFC4180Parser worked for me.

Golu
  • 39
  • 5