0

Can't seem to find this anywhere on stackoverflow so here it goes:

I have a file, I want to discover whether it is pipe(|) or comma(,) seperated. I also want to tell whether the text qualifier is a quote(") or nothing. Anyone have any C# functions that do this? Thanks!

Badmiral
  • 1,549
  • 3
  • 35
  • 74
  • 4
    **Discover** what delimiter is used? What heuristic did you have in mind? – Oded May 07 '12 at 17:53
  • Basically search through a string, and try to parse it and put the delimiter into a some char or string – Badmiral May 07 '12 at 18:01
  • 1
    Do you know anything about the data, such as the number of items per row? – Servy May 07 '12 at 18:01
  • Do you mean for any arbitrary file? What do you **know** about these files? – Oded May 07 '12 at 18:02
  • Pick a delimiter and count how many times it occurs in a significant number of rows. If it always occurs the same number of times as the number of columns, that's probably your delimiter. If the other delimiter gives you the same result, you're screwed. If neither delimiter gives this result, you need to apply more assumptions. – Igby Largeman May 07 '12 at 18:25

3 Answers3

1

This is off the top of my head and assuming that the file has an equal number of columns, and you have a list of characters that are possible delimiters.

char[] delims = { '|', ',', ... };

Take a subset of the lines, or the whole file if it is small enough, and store them in a string array.

string[] lines = text.Split(new char[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries);

Loop through the delimiters, inserting the count of split entries using that delimiter into an array of ints:

int[] counts = lines.Select(s => s.Split(currentDelimiter).Length).ToArray();

Use your own method to see that all the counts equal each other and are all greater than 1. The delimiter you are on is the one to use.

Derreck Dean
  • 3,708
  • 1
  • 26
  • 45
  • Way too many assumptions there. The OP has not given _nearly_ enough details for an answer to be formulated - only guesses. – Oded May 07 '12 at 18:07
  • Many comma/pipe delimited lists won't have the same number of items in each row, and you also need to account for the fact that some of the delimiters could be inside of string qualifiers, which would be a problem for your count. – Servy May 07 '12 at 18:08
  • Good point, @Servy. This could be a duplicate of http://stackoverflow.com/questions/761932/how-should-i-detect-which-delimiter-is-used-in-a-text-file – Derreck Dean May 07 '12 at 18:14
1

For text-separated files such as this I find the TextFieldParser to be a very useful tool. (You can import the visual basic dll to use it in a C# app).

The general strategy that I would use, since according to you there are a fixed number of columns per file, would be to pick a delimiter and continue parsing/reading lines until one line has a different number of columns than the previous line. When that happens switch to the other delimiter (not sure what you want to do if both are invalid). You may want to also throw out the delimiter if it isn't found at all on the first line. Using the TextFieldParser with HasFieldEnclosedInQuotes set to true you can properly handle fields that are escaped in quotes (it will still work just fine if no quotes are used). This will be much easier than trying to manually handle quotes when using regular string manipulation.

Servy
  • 202,030
  • 26
  • 332
  • 449
0

Get the first (or second line, if the first is a header with file names).

Then you can use regex to check the possible formats. i.e.

 Regex rePipesAndQualifier = ("[^|"]*"|);

If rePipesAndQualifier.match(yourFileLine); returns several non-empty matches, then you know it uses pipes as separators an has delimiters.

Make some more regex to check for comma delimited and with and without qualifier.

It depends alittle bit on what you expect to get (all delimited, only string delimited) and what you know (the delimiters are at the beggining and end or only in the middle, the number of fields an so on). That's why I cannot give you an exact solution.

JotaBe
  • 38,030
  • 8
  • 98
  • 117
  • 2
    A pipe-delimited file can have fields with commas, and a comma delimited file can have fields with pipes. The existence of either [alone] tells you nothing. – Servy May 07 '12 at 18:03
  • if there can be a mix of everything and you don't have extar info, use a crystal ball. Seriously, there must be something that you know in advance. – JotaBe May 07 '12 at 18:05
  • 2
    Yes, and that's why we have asked the OP what he knows, or what he wants to base the decision off of, rather than just picking something ourselves which we won't know will work. – Servy May 07 '12 at 18:06
  • For a meaningful algorithm to be suggested, one does indeed need extra information above what the OP posted. As @Servy commented, you have answered without having any such information. – Oded May 07 '12 at 18:06
  • You know you have a file with an equal number of columns in each row, other than that you don't know anything: its either pipe or comma delimited, it may have a text qualifier or not, and you know each row has the same number of columns – Badmiral May 07 '12 at 18:12
  • Please update your question with some samples: if a row has a delimiter, will all the row use the same delimtiter? do you know in advance the number of fields? One regex could handle both kind of delimiters, as well as dispose of the quote delimiters if they exists. – JotaBe May 08 '12 at 09:34
  • @Servy. My answer is a guideline of how he can solve the problem. Of course it's not the exact solution as I don't have all the needed information. When I make this kind of "follow this way answer" if I'm given extra info, I edit and improve it. I also encourgae the OP to edit his question to include the missing info, so that the question becomes useful for other people without the extra info on the comments. – JotaBe May 08 '12 at 09:40