16

how to find out if csv file fields are tab delimited or comma delimited. I need php validation for this. Can anyone plz help. Thanks in advance.

SowmyAnil
  • 177
  • 1
  • 1
  • 3

15 Answers15

37

It's too late to answer this question but hope it will help someone.

Here's a simple function that will return a delimiter of a file.

function getFileDelimiter($file, $checkLines = 2){
        $file = new SplFileObject($file);
        $delimiters = array(
          ',',
          '\t',
          ';',
          '|',
          ':'
        );
        $results = array();
        $i = 0;
         while($file->valid() && $i <= $checkLines){
            $line = $file->fgets();
            foreach ($delimiters as $delimiter){
                $regExp = '/['.$delimiter.']/';
                $fields = preg_split($regExp, $line);
                if(count($fields) > 1){
                    if(!empty($results[$delimiter])){
                        $results[$delimiter]++;
                    } else {
                        $results[$delimiter] = 1;
                    }   
                }
            }
           $i++;
        }
        $results = array_keys($results, max($results));
        return $results[0];
    }

Use this function as shown below:

$delimiter = getFileDelimiter('abc.csv'); //Check 2 lines to determine the delimiter
$delimiter = getFileDelimiter('abc.csv', 5); //Check 5 lines to determine the delimiter

P.S I have used preg_split() instead of explode() because explode('\t', $value) won't give proper results.

UPDATE: Thanks for @RichardEB pointing out a bug in the code. I have updated this now.

Jay Bhatt
  • 5,601
  • 5
  • 40
  • 62
  • $line=.... should be replaced with: if(!$line = $file->fgets()){break;} If $checkLines exceeds the csv file's line count then there will be an error. – Richard EB Dec 08 '15 at 10:27
  • No problem, except your fix still throws a 'Cannot read from file' exception when $checkLines exceeds the number of csv lines. If you want to build the check into the while loop, it should be: while($file->valid() && $i < $checkLines) – Richard EB Dec 09 '15 at 12:32
  • 'Works for me' isn't a good argument against incorrect boolean logic (e.g., your error handling config may be different). Your while expression uses OR not AND, and so 'end of file' will be ignored when $i is below $checkLines. E.g., in CSV file of 1 row, $i will be allowed to count to three rows regardless of the EOF being reached, just because (2<=2)=TRUE – Richard EB Dec 10 '15 at 10:07
  • What you have done will not break the loop if the end of the file has been reached before the 'user specified lines' have been reached, e.g., if you run your code as-is on a CSV file with only 1 line, then PHP throws the error 'Cannot read from file'. This will be my last comment on this, don't worry (I saw your previous angry comments!). – Richard EB Dec 10 '15 at 10:32
  • 1
    @RichardEB Sorry mate. Didn't meant to be rude. Apologies for the angry comment. I am just having a bad day. I have updated my answer now. Thanks for pointing this out. I was testing a newer version of this code which didn't had this problem and could not find out what was wrong. – Jay Bhatt Dec 10 '15 at 10:43
  • Use `$file = new \SplFileObject($file);` while working with namespaces . – Ashwani Panwar Dec 30 '15 at 10:44
  • no, it's never too late for good answer, I'm looking for it right now – vladkras Jan 29 '19 at 09:26
  • There's no easy way to do this. Jay's answer is fine and fast, but note `$results = array_keys($results, max($results));` When several delimiters share the same count, the above line will only return the first one found in the $delimiters array. – Hugues D Nov 15 '19 at 01:46
13

Here's what I do.

  1. Parse the first 5 lines of a CSV file
  2. Count the number of delimiters [commas, tabs, semicolons and colons] in each line
  3. Compare the number of delimiters in each line. If you have a properly formatted CSV, then one of the delimiter counts will match in each row.

This will not work 100% of the time, but it is a decent starting point. At minimum, it will reduce the number of possible delimiters (making it easier for your users to select the correct delimiter).

/* Rearrange this array to change the search priority of delimiters */
$delimiters = array('tab'       => "\t",
                'comma'     => ",",
                'semicolon' => ";"
                );

$handle = file( $file );    # Grabs the CSV file, loads into array

$line = array();            # Stores the count of delimiters in each row

$valid_delimiter = array(); # Stores Valid Delimiters

# Count the number of Delimiters in Each Row
for ( $i = 1; $i < 6; $i++ ){
foreach ( $delimiters as $key => $value ){
    $line[$key][$i] = count( explode( $value, $handle[$i] ) ) - 1;
}
}


# Compare the Count of Delimiters in Each line
foreach ( $line as $delimiter => $count ){

# Check that the first two values are not 0
if ( $count[1] > 0 and $count[2] > 0 ){
    $match = true;

    $prev_value = '';
    foreach ( $count as $value ){

        if ( $prev_value != '' )
            $match = ( $prev_value == $value and $match == true ) ? true : false;

        $prev_value = $value;
    }

} else { 
    $match = false;
}

if ( $match == true )    $valid_delimiter[] = $delimiter;

}//foreach

# Set Default delimiter to comma
$delimiter = ( $valid_delimiter[0] != '' ) ? $valid_delimiter[0] : "comma";


/*  !!!! This is good enough for my needs since I have the priority set to "tab"
!!!! but you will want to have to user select from the delimiters in $valid_delimiter
!!!! if multiple dilimiter counts match
*/

# The Delimiter for the CSV
echo $delimiters[$delimiter]; 
Dream Ideation
  • 153
  • 1
  • 6
  • 1
    Works nicely. I used it as a function before parsing my files. The return value of it,**$delimiters[$delimiter]** I used it in **str_getcsv($value, $delimeter);** – Mohammed Joraid Jan 05 '15 at 06:58
12

There is no 100% reliable way to detemine this. What you can do is

  • If you have a method to validate the fields you read, try to read a few fields using either separator and validate against your method. If it breaks, use another one.
  • Count the occurrence of tabs or commas in the file. Usually one is significantly higher than the other
  • Last but not least: Ask the user, and allow him to override your guesses.
relet
  • 6,819
  • 2
  • 33
  • 41
7

I'm just counting the occurrences of the different delimiters in the CSV file, the one with the most should probably be the correct delimiter:

//The delimiters array to look through
$delimiters = array(
    'semicolon' => ";",
    'tab'       => "\t",
    'comma'     => ",",
);

//Load the csv file into a string
$csv = file_get_contents($file);
foreach ($delimiters as $key => $delim) {
    $res[$key] = substr_count($csv, $delim);
}

//reverse sort the values, so the [0] element has the most occured delimiter
arsort($res);

reset($res);
$first_key = key($res);

return $delimiters[$first_key]; 
Jacta
  • 507
  • 6
  • 17
Thomas Lang
  • 491
  • 3
  • 13
5

In my situation users supply csv files which are then entered into an SQL database. They may save an Excel Spreadsheet as comma or tab delimited files. A program converting the spreadsheet to SQL needs to automatically identify whether fields are tab separated or comma

Many Excel csv export have field headings as the first line. The heading test is unlikely to contain commas except as a delimiter. For my situation I counted the commas and tabs of the first line and use that with the greater number to determine if it is csv or tab

keir
  • 51
  • 1
  • 1
2

Thanks for all your inputs, I made mine using your tricks : preg_split, fgetcsv, loop, etc.

But I implemented something that was surprisingly not here, the use of fgets instead of reading the whole file, way better if the file is heavy!

Here's the code :

ini_set("auto_detect_line_endings", true);
function guessCsvDelimiter($filePath, $limitLines = 5) {
    if (!is_readable($filePath) || !is_file($filePath)) {
        return false;
    }

    $delimiters = array(
        'tab'       => "\t",
        'comma'     => ",",
        'semicolon' => ";"
    );

    $fp = fopen($filePath, 'r', false);
    $lineResults = array(
        'tab'       => array(),
        'comma'     => array(),
        'semicolon' => array()
    );

    $lineIndex = 0;
    while (!feof($fp)) {
        $line = fgets($fp);

        foreach ($delimiters as $key=>$delimiter) {
            $lineResults[$key][$lineIndex] = count (fgetcsv($fp, 1024, $delimiter)) - 1;
        }

        $lineIndex++;
        if ($lineIndex > $limitLines) break;
    }
    fclose($fp);

    // Calculating average
    foreach ($lineResults as $key=>$entry) {
        $lineResults[$key] = array_sum($entry)/count($entry);
    }

    arsort($lineResults);
    reset($lineResults);
    return ($lineResults[0] !== $lineResults[1]) ? $delimiters[key($lineResults)] : $delimiters['comma'];
}
Cyril N.
  • 38,875
  • 36
  • 142
  • 243
2

I used @Jay Bhatt's solution for finding out a csv file's delimiter, but it didn't work for me, so I applied a few fixes and comments for the process to be more understandable.

See my version of @Jay Bhatt's function:

function decide_csv_delimiter($file, $checkLines = 10) {

    // use php's built in file parser class for validating the csv or txt file
    $file = new SplFileObject($file);

    // array of predefined delimiters. Add any more delimiters if you wish
    $delimiters = array(',', '\t', ';', '|', ':');

    // store all the occurences of each delimiter in an associative array
    $number_of_delimiter_occurences = array();

    $results = array();

    $i = 0; // using 'i' for counting the number of actual row parsed
    while ($file->valid() && $i <= $checkLines) {

        $line = $file->fgets();

        foreach ($delimiters as $idx => $delimiter){

            $regExp = '/['.$delimiter.']/';
            $fields = preg_split($regExp, $line);

            // construct the array with all the keys as the delimiters
            // and the values as the number of delimiter occurences
            $number_of_delimiter_occurences[$delimiter] = count($fields);

        }

       $i++;
    }

    // get key of the largest value from the array (comapring only the array values)
    // in our case, the array keys are the delimiters
    $results = array_keys($number_of_delimiter_occurences, max($number_of_delimiter_occurences));


    // in case the delimiter happens to be a 'tab' character ('\t'), return it in double quotes
    // otherwise when using as delimiter it will give an error,
    // because it is not recognised as a special character for 'tab' key,
    // it shows up like a simple string composed of '\' and 't' characters, which is not accepted when parsing csv files
    return $results[0] == '\t' ? "\t" : $results[0];
}

I personally use this function for helping automatically parse a file with PHPExcel, and it works beautifully and fast.

I recommend parsing at least 10 lines, for the results to be more accurate. I personally use it with 100 lines, and it is working fast, no delays or lags. The more lines you parse, the more accurate the result gets.

NOTE: This is just a modifed version of @Jay Bhatt's solution to the question. All credits goes to @Jay Bhatt.

Aaron Marton
  • 131
  • 1
  • 5
1

When I output a TSV file I author the tabs using \t the same method one would author a line break like \n so that being said I guess a method could be as follows:

<?php
$mysource = YOUR SOURCE HERE, file_get_contents() OR HOWEVER YOU WISH TO GET THE SOURCE;
 if(strpos($mysource, "\t") > 0){
   //We have a tab separator
 }else{
   // it might be CSV
 }
?>

I Guess this may not be the right manner, because you could have tabs and commas in the actual content as well. It's just an idea. Using regular expressions may be better, although I am not too clued up on that.

SimonDowdles
  • 2,026
  • 4
  • 24
  • 36
1

you can simply use the fgetcsv(); PHP native function in this way:

function getCsvDelimeter($file)
{
    if (($handle = fopen($file, "r")) !== FALSE) {
        $delimiters = array(',', ';', '|', ':'); //Put all that need check

        foreach ($delimiters AS $item) {
            //fgetcsv() return array with unique index if not found the delimiter
            if (count(fgetcsv($handle, 0, $item, '"')) > 1) {
                $delimiter = $item;

                break;
            }
        }
    }

    return (isset($delimiter) ? $delimiter : null);
}
0

Aside from the trivial answer that c sv files are always comma-separated - it's in the name, I don't think you can come up with any hard rules. Both TSV and CSV files are sufficiently loosely specified that you can come up with files that would be acceptable as either.

A\tB,C
1,2\t3

(Assuming \t == TAB)

How would you decide whether this is TSV or CSV?

Douglas Leeder
  • 52,368
  • 9
  • 94
  • 137
  • 2
    The c can stand for 'character', which makes , ; or \t all valid options. – julesj Jul 02 '12 at 15:52
  • @julesj A quick search suggests very few using the c to mean character, and most meaning comma. Although even then the separator isn't always a comma, just to complicate things. – Douglas Leeder Jul 02 '12 at 19:41
-1

You also can use fgetcsv (http://php.net/manual/en/function.fgetcsv.php) passing it a delimiter parameter. If the function returns false it means that the $delimiter parameter wasn't the right one

sample to check if the delimiter is ';'

if (($data = fgetcsv($your_csv_handler, 1000, ';')) !== false) { $csv_delimiter = ';'; }
Rommy
  • 497
  • 1
  • 4
  • 16
-1

How about something simple?

function findDelimiter($filePath, $limitLines = 5){
    $file = new SplFileObject($filePath);
    $delims = $file->getCsvControl();
    return $delims[0];
}
chaoskreator
  • 889
  • 1
  • 17
  • 39
  • 4
    This doesn't work because getCsvControl() only returns what's been set manually using the setCsvControl() function. It doesn't do any magic guesswork. – ethan May 05 '15 at 18:29
-2

This is my solution. Its works if you know how many columns you expect. Finally, the separator character is the $actual_separation_character

$separator_1=",";
$separator_2=";";
$separator_3="\t";
$separator_4=":";
$separator_5="|";

$separator_1_number=0;
$separator_2_number=0;
$separator_3_number=0;
$separator_4_number=0;
$separator_5_number=0;

/* YOU NEED TO CHANGE THIS VARIABLE */
// Expected number of separation character ( 3 colums ==> 2 sepearation caharacter / row )
$expected_separation_character_number=2;  


$file = fopen("upload/filename.csv","r");
while(! feof($file)) //read file rows
{
    $row= fgets($file);

    $row_1_replace=str_replace($separator_1,"",$row);
    $row_1_length=strlen($row)-strlen($row_1_replace);

    if(($row_1_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_1_number=$separator_1_number+$row_1_length;
    }

    $row_2_replace=str_replace($separator_2,"",$row);
    $row_2_length=strlen($row)-strlen($row_2_replace);

    if(($row_2_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_2_number=$separator_2_number+$row_2_length;
    }

    $row_3_replace=str_replace($separator_3,"",$row);
    $row_3_length=strlen($row)-strlen($row_3_replace);

    if(($row_3_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_3_number=$separator_3_number+$row_3_length;
    }

    $row_4_replace=str_replace($separator_4,"",$row);
    $row_4_length=strlen($row)-strlen($row_4_replace);

    if(($row_4_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_4_number=$separator_4_number+$row_4_length;
    }

    $row_5_replace=str_replace($separator_5,"",$row);
    $row_5_length=strlen($row)-strlen($row_5_replace);

    if(($row_5_length==$expected_separation_character_number)or($expected_separation_character_number==0)){
    $separator_5_number=$separator_5_number+$row_5_length;
    }

} // while(! feof($file))  END
fclose($file);

/* THE FILE ACTUAL SEPARATOR (delimiter) CHARACTER */
/* $actual_separation_character */

if ($separator_1_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_1;}
else if ($separator_2_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_2;}
else if ($separator_3_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_3;}
else if ($separator_4_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_4;}
else if ($separator_5_number==max($separator_1_number,$separator_2_number,$separator_3_number,$separator_4_number,$separator_5_number)){$actual_separation_character=$separator_5;}
else {$actual_separation_character=";";}

/* 
if the number of columns more than what you expect, do something ...
*/

if ($expected_separation_character_number>0){
if ($separator_1_number==0 and $separator_2_number==0 and $separator_3_number==0 and $separator_4_number==0 and $separator_5_number==0){/* do something ! more columns than expected ! */}
}
-3

If you have a very large file example in GB, head the first few line, put in a temporary file. Open the temporary file in vi

head test.txt > te1
vi te1
user45270
  • 69
  • 2
  • 4
-5

Easiest way I answer this is open it in a plain text editor, or in TextMate.

kaleazy
  • 5,922
  • 2
  • 47
  • 51