I am writing code to process a list of tar.gz files, inside which there are multiple, csv files. I have encountered the error below
com.opencsv.exceptions.CsvMalformedLineException: Unterminated quoted field at end of CSV line. Beginning of lost text: [,,,,,,
]
at com.opencsv.CSVReader.primeNextRecord(CSVReader.java:245)
at com.opencsv.CSVReader.flexibleRead(CSVReader.java:598)
at com.opencsv.CSVReader.readNext(CSVReader.java:204)
at uk.ac.shef.inf.analysis.Test.readAllLines(Test.java:64)
at uk.ac.shef.inf.analysis.Test.main(Test.java:42)
And the code causing this problem is below, on line B.
public class Test {
public static void main(String[] args) {
try {
Path source = Paths.get("/home/xxxx/Work/data/amazon/labelled/small/Books_5.json.1.tar.gz");
InputStream fi = Files.newInputStream(source);
BufferedInputStream bi = new BufferedInputStream(fi);
GzipCompressorInputStream gzi = new GzipCompressorInputStream(bi);
TarArchiveInputStream ti = new TarArchiveInputStream(gzi);
CSVParser parser = new CSVParserBuilder().withStrictQuotes(true)
.withQuoteChar('"').withSeparator(',').
.withEscapeChar('|'). // Line A
build();
BufferedReader br = null;
ArchiveEntry entry;
entry = ti.getNextEntry();
while (entry != null) {
br = new BufferedReader(new InputStreamReader(ti)); // Read directly from tarInput
System.out.format("\n%s\t\t > %s", new Date(), entry.getName());
try{
CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
.build();
List<String[]> r = readAllLines(reader);
} catch (Exception ioe){
ioe.printStackTrace();
}
System.out.println(entry.getName());
entry=ti.getNextEntry(); // Line B
}
}catch (Exception e){
e.printStackTrace();
}
}
private static List<String[]> readAllLines(CSVReader reader) {
List<String[]> out = new ArrayList<>();
int line=0;
try{
String[] lineInArray = reader.readNext();
while(lineInArray!=null) {
//System.out.println(Arrays.asList(lineInArray));
out.add(lineInArray);
line++;
lineInArray=reader.readNext();
}
}catch (Exception e){
System.out.println(line);
e.printStackTrace();
}
System.out.println(out.size());
return out;
}
}
I also attach a screenshot of the actual line within the csv file that caused this problem here, look at line 5213. I also include a test tar.gz file here: https://drive.google.com/file/d/1qHfWiJItnE19-BFdbQ3s3Gek__VkoUqk/view?usp=sharing
While debugging, I have some questions.
- I think the issue is the \ character in the data file (line 5213 above), which is the escape character in Java. I verified this idea by adding line A to my code above, and it works. However, obviously I don't want to hardcode this as there can be other characters in the data causing same issue. So my question 1 is: is there anyway to tell Java to ignore escape characters? Something like the opposite of
withEscapeChar('|')
? UPDATE: the answer is to use '\0', thanks to the first comment below. - When debugging, I notice that my program stops working on the next .csv file within the tar.gz file as soon as it hit the above exception. To explain what I mean, inside the tar.gz file included in the above link, there are two csvs: _10.csv and _110.csv. The problematic line is in _10.csv. When my program hit that line, an exception is thrown and the program moves on to the next file _110.csv (
entry=ti.getNextEntry();
). This file is actually fine, but the methodreadAllLines
that is supposed to read this next csv file will throw the same exception immediately on the first line. I don't think my code is correct, especially thewhile
loop: I suspect the input stream was still stuck at the previous position that caused the exception. But I don't know how to fix this. Help please?