We receive files in a number of different formats - CSV, TSV, or other flat files using more exotic delimiters (|, ; etc). These files may also use text qualifiers, again in a range of formats (every field qualified vs only those containing a delimiter qualified, different chars used ', " etc).
I have written a tool that is able to successfully identify delimiters in the file using a frequency analysis technique not unlike the Python sniffer class mentioned here: How should I detect which delimiter is used in a text file?
I'm now attempting to extend the tool to support text qualified files. The difficulty here is that frequency analysis is insufficient to identify text qualifiers, as many forms of CSV will only wrap fields containing the delimiter with text qualifiers, so for example a file with 10k rows might only have 2 occurrences of the text qualifier in the whole file.
My current approach is to scan the file looking for delimiter-text qualifier pairs (e.g. ,' and ',) and then compare them to other potential pairs (e.g. ," and ",) and select the most frequently occurring.
Can anyone offer a more robust alternative? A key constraint to the problem is that I must support files in any of the many different flavors of CSV that can be created. My goal is to support as many cases as possible without user intervention.