0

We receive files in a number of different formats - CSV, TSV, or other flat files using more exotic delimiters (|, ; etc). These files may also use text qualifiers, again in a range of formats (every field qualified vs only those containing a delimiter qualified, different chars used ', " etc).

I have written a tool that is able to successfully identify delimiters in the file using a frequency analysis technique not unlike the Python sniffer class mentioned here: How should I detect which delimiter is used in a text file?

I'm now attempting to extend the tool to support text qualified files. The difficulty here is that frequency analysis is insufficient to identify text qualifiers, as many forms of CSV will only wrap fields containing the delimiter with text qualifiers, so for example a file with 10k rows might only have 2 occurrences of the text qualifier in the whole file.

My current approach is to scan the file looking for delimiter-text qualifier pairs (e.g. ,' and ',) and then compare them to other potential pairs (e.g. ," and ",) and select the most frequently occurring.

Can anyone offer a more robust alternative? A key constraint to the problem is that I must support files in any of the many different flavors of CSV that can be created. My goal is to support as many cases as possible without user intervention.

Community
  • 1
  • 1
Root_Kabal
  • 53
  • 1
  • 11

2 Answers2

2

You could try some pattern matching with regular expressions. Since you already know the delimiter you could try a list of common qualifier such as " or '. If those fail you could try to parse the lines with incorrect column alignment for inputs that would match the pattern and produce the expected column alignment.

var delimiter = ",";
var qualifiers = new[] { "\"", "'" };
var input = @"""Hello, World"", Hello, World";
var pattern = @"(?<={1}).*{0}.*(?={1})";

foreach(var p in qualifiers.Select(q => string.Format(pattern, delimiter, q)))
{
    Regex.Match(input, p);
}
Dustin Kingen
  • 20,677
  • 7
  • 52
  • 92
  • Thanks, I will look into this further. I thought that Regex's could be the way to go, but was struggling to think of the right expression as there's several cases to consider - if the first field is text qualified, it would be "...", for example, rather than ,"...", and this is flipped for a text qualified final field. Your answer could be just the springboard I need though. – Root_Kabal May 30 '13 at 08:12
0

Can you extract special characters with their index using regex? On index you can get the sequence.

Amit
  • 882
  • 6
  • 13