0

Having used SQL Server Bulk insert of CSV file with inconsistent quotes (CsvToOtherDelimiter option) as my basis, I discovered a few weirdnesses with the RemoveCSVQuotes part [it chopped the last char from quoted strings that contained a comma!]. So.. rewrote that bit (maybe a mistake?)

One wrinkle is that the client has asked 'what about data like this?'

""17.5179C,""

I assume if I wanted to keep using the CsvToOtherDelimiter solution, I'd have to amend the RegExp...but it's WAY beyond me... what's the best approach?

To clarify: we are using C# to pre-process the file into a pipe-delimited format prior to running a bulk insert using a format file. Speed is pretty vital.

Community
  • 1
  • 1
kpollock
  • 3,899
  • 9
  • 42
  • 61

3 Answers3

0

To convert a CSV string into a list of elements, you could write a program that keeps track of state (in quotes or out of quotes) as it processes the string one character at a time, and emits the elements it finds. The rules for quoting in CSV are weird, so you'll want to make sure you have plenty of test data.

The state machine could go like this:

  1. scan until quote (go to 2) or comma (go to 3)
  2. if the next character is a quote, add only one of the two quotes to the field and return to 1. Otherwise, go to 4 (or report an error if the quote isn't the first character in the field).
  3. emit the field, go to 1
  4. scan until quote (go to 5)
  5. if the next character is a quote, add only one of the two quotes to the field and return to 4. Otherwise, emit the field, scan for a comma, and go to 1.

This should correctly scan stuff like:

  • hello, world, 123, 456
  • "hello world", 123, 456
  • "He said ""Hello, world!""", "and I said hi"
  • ""17.5179C,"" (correctly reports an error, since there should be a separator between the first quoted string "" and the second field 17.5179C).

Another way would be to find some existing library that does it well. Surely, CSV is common enough that such a thing must exist?

edit:

You mention that speed is vital, so I wanted to point out that (so long as the quoted strings aren't allowed to include line returns...) each line may be processed independently in parallel.

redtuna
  • 4,586
  • 21
  • 35
  • I will have to confirm, but I think the client wants ""17.5179C,"" to end up as "17.5179C," on the database. – kpollock Mar 26 '13 at 16:43
  • Since there are many variants, it's possible that this makes sense for your client. I recommend you get a clear spec (or at least a lot of examples of what they want). It should be easy to adapt the algorithm I put up there to use "" as delimiters for quoted strings instead of ". – redtuna Mar 26 '13 at 16:47
  • @kpollock Or you could just use the CSV parser built into the .NET framework (http://msdn.microsoft.com/en-us/library/microsoft.visualbasic.fileio.textfieldparser.aspx) and get it right the first time. – Tergiver Mar 26 '13 at 18:33
  • @Tergiver nice, I didn't know about that parser. Of course, it's only useful here if it can be made to agree exactly with the dialect that your client wants to read. – redtuna Mar 26 '13 at 19:47
  • Looking around I also see there's some existing code there: http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader – redtuna Mar 26 '13 at 19:55
0

The accepted answer from your link starts with:

You are going to need to preprocess the file, period.

Why not transform your csv to xml? Then you would be able to verify your data against an xsd before storing into a database.

Alina B.
  • 1,256
  • 8
  • 18
0

I ended up using the csv parser that I don't know we had already (comes as part of our code generation tool) - and noting that ""17.5179C,"" is not valid and will cause errors.

kpollock
  • 3,899
  • 9
  • 42
  • 61