What you have here is an unusual dialect of CSV.
Although there is no formalised standard for CSV, there are broadly two approaches to quotes:
- Quotes are not special. That is:
7" single, 12" album
is two items: 7" single
and 12" album
. In this dialect, items containing ,
are problematic.
- Quotes are special. That is:
"you, me","me you"
is two items: you, me
and me, you
. In this dialect, you can put quotes around an entry in order to have a ,
within an item. However it makes items containing "
problematic, as you have found.
The typical answer to the "
problem in the second approach, is to escape quotes. So the item 7" single
would appear in the CSV as "7\" single"
. This of course means that \
becomes a problem, but that's easily solved the same way. AC\DC 7" single
appears in the CSV as "AC\\DC 7\" single"
.
If you can adopt one of these conventional approaches, then do so. Then you can either use an existing CSV library, or roll your own. Although a regex can consume these formats, my opinion is that it's not the clearest way to write code to consume CSV: I've found that a more explicit state machine (e.g. a switch (state)
statement) is nice and clear.
If you can't change your input format, the puzzle you have to solve is, when you encounter a "
, is it a metacharacter (part of a pair of quotes surrounding an item) or is it a real character that's part of the item?
As owner of the format, it's up to you to decide what the rule is. Perhaps a "
should only be considered a metacharacter if it's next to a ,
. But even that causes problems if you allow a mixture of quoted and unquoted items:
"A Town Called Malice", The Jam, 7", £6.99
So, you must come up with your own rules, that work in your domain, and write explicit code to handle that situation. One approach is to pre-process the input into canonical CSV so that it's again suitable for a conventional CSV parser.