3

I have data as follows

ID1,ID2,FIRST_NAME,LAST_NAME,BIRTH_DATE,HA1,HA2,HA3,STATUS,DT
99,13863926H,MAL"COLMHS,ABBOT,1997-04-09,AMKC,RR,RR  ,DE,
89,12973388H,"SAGAR,TARLE",ABDAT,1997-11-02,RNDC,RR,RR  ,DE,
71,88JunkTest,Howdy,Doody,1985-11-02,RNDC,HA,HACLASSTYPE  ,DE,2019-12-25

I am trying to parse the csv using open CSV where in my CSV first name can contain double quotes(MAL"COLMHS) or double quotes with a comma ("SAGAR,TARLE") or first name without a double quote.

So using .withIgnoreQuotations(true) I can parse first row(MAL"COLMHS) but not able to find the solution to parse 2nd row.

I tried with the solutions with multiple StackOverflow links but not able to solve them.

I know my CSV file is inconsistent but there are too many of such records present in CSV file from the client and its hard to make it consistent manually so trying to search automated solution.

 List<Results> beans = new CsvToBeanBuilder<Results>(new FileReader(file.getAbsolutePath()))
                            .withType(Results.class)
                            .withIgnoreQuotations(true)
                            .build().parse();

ERROR

java.lang.RuntimeException: Error parsing CSV line: 3. [3491903139,12973388H,SAGAR,TARLE,ABDAT,1997-11-02,RNDC,RR,RR  ,DE,]
    at com.opencsv.bean.CsvToBean.parse(CsvToBean.java:366)
    at com.apds.partner.nycdoc.main.NycDocApplication.main(NycDocApplication.java:81)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.boot.devtools.restart.RestartLauncher.run(RestartLauncher.java:49)
Caused by: com.opencsv.exceptions.CsvRequiredFieldEmptyException: Number of data fields does not match number of headers.
    at com.opencsv.bean.HeaderColumnNameMappingStrategy.verifyLineLength(HeaderColumnNameMappingStrategy.java:110)
    at com.opencsv.bean.AbstractMappingStrategy.populateNewBean(AbstractMappingStrategy.java:313)
    at com.opencsv.bean.concurrent.ProcessCsvLine.processLine(ProcessCsvLine.java:132)
    at com.opencsv.bean.concurrent.ProcessCsvLine.run(ProcessCsvLine.java:85)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
*****

Edit: I tried with SuperCSV also but same issue

stackUser
  • 545
  • 3
  • 9
  • 21
  • You should post the code you tried so we can tell where is wrong – Sterconium Nov 11 '19 at 11:14
  • Please check the updated code – stackUser Nov 11 '19 at 11:28
  • So "SAGAR,TARLE" would be parse to FIRST_NAME with value as SAGAR,TARLE? Let me know if my understanding is correct. – hagarwal Nov 11 '19 at 11:42
  • You need to fix your csv. You can't ignore double quotes and use them to enclose fields at the same time. – Oleg Nov 11 '19 at 11:44
  • @hagarwal yes, "SAGAR,TARLE" would be parsed to FIRST_NAME with value as SAGAR, TARLE or – stackUser Nov 11 '19 at 11:56
  • Probably, this post https://stackoverflow.com/questions/41948442/parse-csv-with-opencsv-with-double-quotes-inside-a-quoted-field may help you. Good luck! – Dmitriy S Nov 11 '19 at 12:26
  • By the way, the best solution would be to ask your client to provide a well-formed CSV file, I believe. You will have one less problem while dealing with this file, customer will have one less bug, and the world's total technical debt will decrease by a bit. :D – Xobotun Nov 11 '19 at 14:13

2 Answers2

0

You just have a malformed csv file. According to RFC-4180, section 2.5

If fields are not enclosed with double quotes, then double quotes may not appear inside the fields.

and section 2.7

If double-quotes are used to enclose fields, then a double-quote appearing inside a field must be escaped by preceding it with another double quote.

I've looked up this question, try replacing all single double-quotes with double double-quotes. And don't forget to wrap all the field within the double-quotes.

In your example, 99,13863926H,"MAL""COLMHS",ABBOT,1997-04-09,AMKC,RR,RR ,DE, should work.

UPD: Well, if you do not want to edit manually to make it RFC-compliant I suggest you running this regex: ^(?:\d*,[^,]*,)([^"]\w+(?:"\w+)+)(?:,) against your file to check how many of the wrong records there are.

You may want to use the only capturing group to extract the malformed name and escape it correctly, then write the changes back to the file and re-read it with the parser of your choice.

Community
  • 1
  • 1
Xobotun
  • 1,121
  • 1
  • 18
  • 29
0

I think that the real problem here is that your CSV file is non-conformant.

The first data line has 10 fields, one of which contains an unbalanced double-quote.

  • If you don't ignore double quotes, then the first data line is not parsable.

  • If you do ignore double quotes, then the second data line has 11 fields.

Basically, the first line is malformed. It should say this:

 99,13863926H,"MAL""COLMHS",ABBOT,1997-04-09,AMKC,RR,RR  ,DE,

I don't think there is a good way to fix this, apart from rejecting the malformed input:

  • If the problem is bad data, get a human being to fix the (hand created) input file or the data source that the input file is extracted from.

  • If the problem is in the program that is extracting the data and generating the CSV, then fix that.

If you really want to parse this input as-is, you will need to implement your own CSV parser by hand to do the job. OpenCSV won't handle this input, and nor will any other standards-based parser.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216