1

I am building a tool that will accept a CSV or tab-delimited file, which will then be parsed and the data databased.

The uploaded file can be CSV or tab-delimited.

I came up with a workable solution (below) for detecting what format the file might be in and would like to know if there is a better way to solve this and/or how any of you out there have solved the same problem.

Thanks

<?php

$csv_comma='Fruit,Color
Apple,"Red,Green"
Tomato,"Red,Green"
Banana,Yellow
Tangerine,Orange
';


$csv_semi_colon='Fruit;Color
Apple;"Red,Green"
Tomato;"Red,Green"
Banana;Yellow
Tangerine;Orange
';


$tab_delimited='Fruit   Color
Apple   Red,Green
Tomato  Red,Green
Banana  Yellow
Tangerine   Orange';


$fileArr = array($csv_comma,$csv_semi_colon,$tab_delimited);


foreach($fileArr as $file){

   if(preg_match('/^(.+),(.+)/',trim($file))){

       echo "CSV with comma separator";

   }

   if(preg_match('/^(.+);(.+)/',trim($file))){

       echo "CSV with semi colon separator";

    }


   if(preg_match('/^(.+)\t(.+)/',trim($file))){

       echo "Tab delimited";

   }
}
Slinky
  • 5,662
  • 14
  • 76
  • 130
  • Any heuristic will give wrong results. You just cannot guess 100% right what separator was used – zerkms May 30 '13 at 12:18
  • 2
    [Here](http://stackoverflow.com/a/762307/20670)'s how Python does it. – Tim Pietzcker May 30 '13 at 12:20
  • Why not just try parsing the csv with `fgetcsv` with `, or ; ... etc` as your delimeter? If it fails, it must not be a valid csv. – brbcoding May 30 '13 at 12:21
  • Or why not just have three radio buttons in your upload form, where the user can specify which of the delimiters are used in the file they are uploading ...? – CBroe May 30 '13 at 12:26
  • It would be better if you read only the first line of the file and matching against it rather than matching the whole file - it will be a lot faster and will use less system resource usage. – tftd May 30 '13 at 13:14

2 Answers2

1

Well csv has this prety much implemented. Default for csv is , but with sep= you could specify an other seperator.

You could just implement this as csv. So you have a default of , but if the sep is defined you use that.

You file could look like:

apple, orange, tomato

or

sep=;
apple; orange; tomato

So if the first line starts with sep, it is an "option" line otherwise there are values. For tab you do sep=\t

Now users can define there own seperator and no guessing any more


After some comments of CBroe of easy to use for the user there could be some changes. csv only accepts one charachter as septerator so that system could be use like the above. cvs editor (like excel) will handle that for the user

If the user uses the tab it won't be a csv file but a .txt (for example). So you could change the default according to the file given.

Also I want to add, already pointed out in the comments, if you want to guess you will hit a point where it will occure it is wrong.

I don't know the setup of the files but csv lines need to be the same length (according to my memory). So what you could do is read out the first x lines. and use every seperator.

After that you check which lines lengths are the same, most likely that is your seperator (again guessing)

MKroeders
  • 7,562
  • 4
  • 24
  • 39
  • That would require modifying the CSV file though - maybe not convenient or even possible. – CBroe May 30 '13 at 12:25
  • Open a csv file with a text editor. You can also make a .txt file with this setup. – MKroeders May 30 '13 at 12:26
  • _"Open a csv file with a text editor."_ - wow, why didn't I think of that? `` - seriously, for reasons of convenience (_"What, I have to modify my file before uploading?"_) that might not be an option, and also not so tech-savy users might mess up the file while trying to do what they are asked ... so I'm not saying your answer is _wrong_ or not a possible solution - but just that it might create a bigger problem than it actually solves. – CBroe May 30 '13 at 12:29
  • As I said default of csv is `,`. If you change that value you are already messing with the csv. Also automaticly the csv will place that sep on top of the file. because csv can only handle 1 char as sperator, i never saw a tab. So if tab is used someone is already messing around with a text file. – MKroeders May 30 '13 at 12:32
  • _"Also automaticly the csv will place that sep on top of the file."_ - no, CSV is a _format_ and not some active component that modifies itself. If at all, it will be the application _generating_ that CSV file that will place that `sep` on top of the file - but I don't know all possible applications, and I bet neither do you ... so no guarantee it will be there. - _"So if tab is used someone is already messing around with a text file."_ - nonsense, many tools offer a choice of what separator to use when creating CSV output. – CBroe May 30 '13 at 12:35
  • How is tab not one character? (`\t` is just a convention for writing it in an environment where a "real" tab character can not be used or might lead to side effects, for example in PHP code when declaring a text literal - but as the real character, tab _is_ one character IMHO.) – CBroe May 30 '13 at 12:52
  • @Hendriq: "Default of csv is `,`" - not in Europe, for example. Here it's `;`, so if you try to open a CSV file with a German Excel version that was written with a US Excel version (or vice versa), you're in for a surprise... – Tim Pietzcker May 30 '13 at 13:24
  • @TimPietzcker In regards to the different versions of Excel, the OP could very well add a find & replace function. – Funk Forty Niner May 30 '13 at 13:42
  • @Fred: How would you do the replace command for the following European CSV file: `"1;2";3,4;"5;6,7"`? :) – Tim Pietzcker May 30 '13 at 13:44
  • @TimPietzcker Hm, you have a point there. I guess the OP will have to give explicit instructions to the people sending over the files, or put in a function to see if the files do contain those characters and return an error message. – Funk Forty Niner May 30 '13 at 13:48
  • 1
    @Fred: Or do what I proposed in my comment to the question above. – Tim Pietzcker May 30 '13 at 13:53
0

You can use this kind of pattern to check the csv structure and determine the separator:

if (preg_match('^(?:("[^"]++"|[^,;\t\n]++)(?<sep>[,\t;])(?1)(?:\n|$))++$', $csv_comma, $match))
    print_r($match['sep']);
Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125