I'm looking for Java implementation of CSV (comma separated values) parser with proper handling of Unicode data, e.g. UTF-8 CSV files with Chinese text. I suppose such a parser should internally use code point related methods while iterating, comparing etc. Apache 2 license or similar would work the best.
-
http://sourceforge.net/projects/javacsv/ try it – Bozho Dec 23 '09 at 18:17
-
1Most CSV parsers should handle 16-bit characters. Are you saying you need 32-bit character support? – Peter Lawrey Dec 23 '09 at 18:19
-
I tried couple of parsers, including one in-house from other project. So it seems they all do attempt internally to split fields by iterating using 1) read line 2) go over line using charAt() and append to some temp char. I have UTF-8 files with Chinese text, some symbols are encoded with 3 bytes, so that doesn't work. It seems even starting BOM is not handled correctly in many parsers. – Igor Romanov Dec 23 '09 at 19:33
-
"Proper handling" of UTF-8 should not be an issue if it's a real UTF-8, because it's already handled by Java (InputStreamReader with explicit charset), not something that Parser should care about. The question is quite old, maybe it's high time to accept some answer? – Danubian Sailor Sep 01 '14 at 13:31
3 Answers
I don't believe in reinventing the wheel. So I do not want to write my own parser and go through the same headaches someone else did.
I personally like the CSV Parser from Ostermiller. They also have a Maven Repository if interested.
You can also check out OpenCSV. There is a Stack Overflow question already about parsing unicode.

- 1
- 1

- 14,409
- 18
- 71
- 103
-
This one looks good and even it is stated directly to support Chinese, but its GPL I think, that is something I cannot use for my work. – Igor Romanov Dec 23 '09 at 19:51
It's pretty easy to write yourself. Open the file with a FileInputStream and an InputStreamReader that uses UTF-8. Wrap it in a BufferedReader you can iterate through it using readLine(). Get each line as a String. Use regular expressions to split it into fields.
The only tricky part is constructing the regexes so they don't treat commas that are enclosed within quotes as field delimiters.
The approach above is a bit inefficient, but fast enough for most apps. If you have real performance requirements then you'll need something that iterates through characters. I wrote one a few years ago that uses a state machine that worked ok.

- 15,239
- 27
- 91
- 157
-
That's more straightforward than I can afford without having bad dreams at night :-) I'm now looking for ready-to-be-used library. – Igor Romanov Dec 23 '09 at 19:37
-
this is actually *not* straightforward. The simple case can be handled with regexes, but when you get into fields that themselves contain commas or the (optional) quote delimeters, Regex will not work. Regex is a fine tool for certain jobs, but it is not a substitute for a well written parser. – Kevin Day Dec 24 '09 at 03:43
-
I think it will work, just will be a bit more complex. Google gives good regexp to use instantly, see here for example: http://www.programmersheaven.com/user/Jonathan/blog/73-Splitting-CSV-with-regex/ – Igor Romanov Dec 24 '09 at 11:08