Unicode aware CSV parser in C++

Question

This might be duplicated to C CSV API for unicode, or How can I read and parse CSV files in C++?, but not exactly. The first one talks about a C library which has the potential to work but needs some code modification. The second one doesn't mention much about unicode support. I would rather open a new question instead of polluting the existing ones.

Since I am not an expert in i18n and unicode encoding stuff, I just wonder if there is such an library for C++ out-of-box?

Currently, my best effort is to call Python's csv parser in C++, which is pretty slow.

You can probably find something "off the shelf" on github... Or you can just write your own in 30 minutes. — Nicolas Louis Guillemot, Mar 30 '14 at 22:13
The C++ language treats Unicode like it is a bad case of body odor. Nothing that isn't fixable, using the ICU library is boilerplate. The usual hangup is to want to stop using ICU as soon as possible, just using it to convert the file/text and get back to friendly Mr Rogers neighborhood of std::string or const char* as quick as you can. Big Mistake. No real point in using C++ for text conversion anyway, Python will do it just as quickly. File conversion is I/O bound, it doesn't need a fast language. — Hans Passant, Mar 30 '14 at 22:29
@HansPassant Well, I can't choose other language. I am writing a CSV loader for sqlite3 and it has to be built into a .so file. The simple CSV loader come with sqlite3 doesn't support i18n correctly, so I have to implement one myself. — Kan Li, Mar 31 '14 at 04:08
You can choose ICU. Have you done so? Where did you get stuck? — Hans Passant, Mar 31 '14 at 09:04
@HansPassant, where I get stuck is, I don't want to go into the details of CSV file format by reading some boring RFC standards, nor do I want to handle all sorts of dialects (in Python's csv package), nor the subtlties when unicode characters kick in. Well, I could read the docs to fully understand all of these details, but I would like to see if there exists prior works that I can re-use. Being lazy is the virtue of a programmer. — Kan Li, Mar 31 '14 at 10:22
By Unicode, do you mean UTF-8? If so, handling UTF-8 not much different from handling text as usual. — 200_success, Apr 06 '14 at 12:39

score 0 · Answer 1 · edited May 23 '17 at 12:14

From a CSV perspective it actually doesn't matter at all. Because, all traditional CSV control characters are in the lower ASCII region and that is identical for all ASCII and Unicode variations. You should rather ask, "how do I read or write file content as a stream of Unicode characters" in your particular programming language. You can then interpret that character stream for whatever format you want, including CSV.

Unicode aware CSV parser in C++

1 Answers1