18

I have seen multiple threads about what the best solution to auto detect the delimiter for an incoming CSV. Most of them are functions of length between 20 - 30 lines, multiple loops pre-determined list of delimiters, reading the first 5 lines and matching counts e.t.c e.t.c

Here is 1 example

I have just implemented this procedure, with a few modifications. Works brilliantly.

THEN I found the following code:

private function DetectDelimiter($fh)
{
    $data_1 = null;
    $data_2 = null;
    $delimiter = self::$delim_list['comma'];
    foreach(self::$delim_list as $key=>$value)
    {
        $data_1 = fgetcsv($fh, 4096, $value);
        $delimiter = sizeof($data_1) > sizeof($data_2) ? $key : $delimiter;
        $data_2 = $data_1;
    }

    $this->SetDelimiter($delimiter);
    return $delimiter;
}

This to me looks like it's achieving the SAME results, where $delim_list is an array of delimiters as follows:

static protected $delim_list = array('tab'=>"\t", 
                                     'semicolon'=>";", 
                                     'pipe'=>"|", 
                                     'comma'=>",");

Can anyone shed any light as to why I shouldn't do it this simpler way, and why everywhere I look the more convoluted solution seems to be the accepted answer?

Thanks!

Community
  • 1
  • 1
simon_www
  • 489
  • 2
  • 5
  • 13
  • I think this solution is more readable, and cleaner the the other example in link. – vaso123 Nov 03 '14 at 15:39
  • 1
    Am I the only one who thinks it's logically wrong to magically determine the delimiter? If the delimiter isn't a comma (as the name suggests - COMMA separated values), then look in the request for a specified delimiter. If none, abort the parsing and that's it until the valid info is sent. – N.B. Nov 03 '14 at 16:07
  • 1
    @N.B. you have a valid point. For this project I merely need it to guess before giving the user options. If using method 2 detailed above fails to give me any legit data then I will ask the user to specify their delimiters. What I like about method 1 however is that if it finds 2 or more matching delimiters, then I can inform the user and get them to choose from the options I think I've found, or suggest their own. – simon_www Nov 03 '14 at 16:58

7 Answers7

19

This function is elegant :)

/**
* @param string $csvFile Path to the CSV file
* @return string Delimiter
*/
public function detectDelimiter($csvFile)
{
    $delimiters = [";" => 0, "," => 0, "\t" => 0, "|" => 0];

    $handle = fopen($csvFile, "r");
    $firstLine = fgets($handle);
    fclose($handle); 
    foreach ($delimiters as $delimiter => &$count) {
        $count = count(str_getcsv($firstLine, $delimiter));
    }

    return array_search(max($delimiters), $delimiters);
}
Ahmed Bermawy
  • 2,290
  • 4
  • 35
  • 42
  • 1
    Nice! But it will return the wrong value of no delimiter is found (as it then just returns a random element from array) – Sliq Apr 21 '20 at 10:36
  • 2
    To detect if no delimiter is found one could add before the return statement: `if( array_sum( $delimiters ) <= count( $delimiters ) ) return false;` – Paul Naveda Dec 03 '20 at 22:19
  • 1
    You should be sure that the first line of the CSV file includes the labels of the cols, else it can fail gracefully. Scanning and comparing several lines is advised. – David Aug 27 '21 at 03:48
  • 1
    @PaulNaveda why the amount of defined delimiters should be helping to decide if the result is correct? In first place `max($delimiters)` just should be larger than 0. Comparing several lines can still give more clues which candidate is the correct one. – David Aug 27 '21 at 04:57
  • @David Could you please suggest a proper solution if you have one or a reference? thanks! – Braza Oct 14 '21 at 19:29
  • 2
    @Braza in a single row it might be really hard to detect it reliable if some extreme edge cases are tested. I created an Utility-Class which checks each row and returns the results for a whole file or at least several rows. I used some code from this page here. You can have a look here: https://gist.github.com/DavidBruchmann/1215dc4fb9b7bd339253de5b6e304909 – David Oct 14 '21 at 19:53
  • 1
    @David Amazing work there! Thanks man! This is totally worth being an Answer to this Question ;) – Braza Oct 15 '21 at 20:04
  • Another minor modification: `$_max = max($delimiters); $delimiters =array_diff(delimiters, array_diff_assoc(delimiters, array_unique(delimiters))); return $_max == 0 ? false : array_search($_max, $delimiters);`. So it will return false when the amount is same of the most counted results (also fixes the quantity 0 for all values) – Aldo Oct 26 '22 at 07:41
9

None of these worked for my use case. So I made some slight modifications.

   /**
    * @param string $filePath
    * @param int $checkLines
    * @return string
    */
   public function getCsvDelimiter(string $filePath, int $checkLines = 3): string
   {
      $delimiters =[",", ";", "\t"];

      $default =",";

       $fileObject = new \SplFileObject($filePath);
       $results = [];
       $counter = 0;
       while ($fileObject->valid() && $counter <= $checkLines) {
           $line = $fileObject->fgets();
           foreach ($delimiters as $delimiter) {
               $fields = explode($delimiter, $line);
               $totalFields = count($fields);
               if ($totalFields > 1) {
                   if (!empty($results[$delimiter])) {
                       $results[$delimiter] += $totalFields;
                   } else {
                       $results[$delimiter] = $totalFields;
                   }
               }
           }
           $counter++;
       }
       if (!empty($results)) {
           $results = array_keys($results, max($results));

           return $results[0];
       }
return $default;
}

Samuel James
  • 1,528
  • 16
  • 15
  • 1
    This works fine, I am using laravel and I really cant find the CSV delimeter checker... so I copied your code here.. thanks a lot ! – Jenuel Ganawed Dec 21 '21 at 00:16
8

Fixed version.

In your code, if a string has more than 1 delimiter you'll get a wrong result (example: val; string, with comma;val2;val3). Also if a file has 1 row (count of rows < count of delimiters).

Here is a fixed variant:

private function detectDelimiter($fh)
{
    $delimiters = ["\t", ";", "|", ","];
    $data_1 = null; $data_2 = null;
    $delimiter = $delimiters[0];
    foreach($delimiters as $d) {
        $data_1 = fgetcsv($fh, 4096, $d);
        if(sizeof($data_1) > sizeof($data_2)) {
            $delimiter = $d;
            $data_2 = $data_1;
        }
        rewind($fh);
    }

    return $delimiter;
}
participant
  • 2,923
  • 2
  • 23
  • 40
nzy
  • 161
  • 2
  • 3
  • Got countable error in php7.2. Changed $data_1 = null; $data_2 = null; to $data_1 = []; $data_2 = []; – Vit Mar 05 '21 at 10:44
2

In general, you cannot detect the delimiter for a text file. If there are additional hints, you need to implement them in your detection to be sure.

One particular problem with the suggested approach is that it will count the number of elements in different lines of the file. Suppose you had a file like this:

a;b;c;d
a   b;  c   d
this|that;here|there
It's not ready, yet.; We have to wait for peter, paul, and mary.; They will know what to do

Although this seems to be separated by a semicolon, your approach will return comma.

andy
  • 2,002
  • 1
  • 12
  • 21
  • 1
    even on common csv reader (i.e openoffice, excel) differents delimiters in same file will reach to an obscure error – bastien Jul 04 '19 at 09:22
  • 2
    True and exactly my point here: If you want to properly read csv, ask the user to specify the delimiter. – andy Jul 15 '20 at 14:43
1

Another one (built by combining lots of answers I found on the internet:

/**
 * Detects the delimiter of a CSV file (can be semicolon, comma or pipe) by trying every delimiter, then
 * counting how many potential columns could be found with this delimiter and removing the delimiter from array of
 * only one columns could be created (without a working limiter you'll always have "one" column: the entire row).
 * The delimiter that created the most columns is returned.
 *
 * @param string $pathToCSVFile path to the CSV file
 * @return string|null nullable delimiter
 * @throws \Exception
 */
public static function detectDelimiter(string $pathToCSVFile): ?string
{
    $delimiters = [
        ';' => 0,
        ',' => 0,
        "|" => 0,
    ];

    $handle = fopen($pathToCSVFile, 'r');
    $firstLine = fgets($handle);
    fclose($handle);

    foreach ($delimiters as $delimiterCharacter => $delimiterCount) {
        $foundColumnsWithThisDelimiter = count(str_getcsv($firstLine, $delimiterCharacter));
        if ($foundColumnsWithThisDelimiter > 1) {
            $delimiters[$delimiterCharacter] = $foundColumnsWithThisDelimiter;
        }else {
            unset($delimiters[$delimiterCharacter]);
        }
    }

    if (!empty($delimiters)) {
        return array_search(max($delimiters), $delimiters);
    } else {
        throw new \Exception('The CSV delimiter could not been found. Should be semicolon, comma or pipe!');
    }
}

And the according Unit Tests (you'll have to add custom test.csv files):

/**
 * Test the delimiter detector
 *
 * @test
 */
public function testDetectDelimiter()
{
    $this->assertEquals(',', Helper::detectDelimiter('test1.csv'));
    $this->assertEquals(';', Helper::detectDelimiter('test-csv-with-semicolon-delimiter.csv'));
    $this->assertEquals('|', Helper::detectDelimiter('test-csv-with-pipe-delimiter.csv'));

    $this->expectExceptionMessage('The CSV delimiter could not been found. Should be semicolon, comma or pipe!');
    Helper::detectDelimiter('test-csv-with-failing-delimiter.csv');
}
Sliq
  • 15,937
  • 27
  • 110
  • 143
0

Ok, this one parses a single line of your CSV (usually you take the first) and throws Exceptions if multiple delimeters are possible or no one matches. According to that, it also looks that the delimeter you want to test against is not in a quoted string or escaped.

    public function getDelimiter(string $content, $throwExceptionOnNonUnique = true, $expectSingleColumn = false): string
    {
        // Would be cleaner if you pass the delimiters from outside
        // as also the order matters in the special case you've got something like "a,b;c"
        // and you don't throw the exception - then the first match is preferred
        // But for StackOverflow I put them inside
        $delimiters = ["\t", ";", "|", ","];
        $result = ',';
        $maxCount = 0;

        foreach ($delimiters as $delimiter) {
            // Impress your code reviewer by some badass regex ;)
            $pattern = "/(?<!\\\)(?:\\\\\\\)*(?!\B\"[^\\\"]*)\\" . $delimiter . "(?![^\"]*\\\"\B)/";
            $amount = preg_match_all($pattern, $content);

            if ($maxCount > 0 && $amount > 0 && $throwExceptionOnNonUnique) {
                $msg = 'Identifier is not clear: "' . $result . '" and "' . $delimiter . '" are possible';
                throw new \Exception($msg);
            }

            if ($amount > $maxCount) {
                $maxCount = $amount;
                $result = $delimiter;
            }
        }

        // If nothing matches and you don't expect that just the CSV just
        // consists of one single column without a delimeter at the end
        if ($maxCount === 0 && !$expectSingleColumn) {
            throw new \Exception('Unknown delimiter');
        }

        return $result;
    }

P.S: Also unit tested - but I don't want to paste 100+ lines of the tests here ;)

bibamann
  • 2,469
  • 1
  • 11
  • 9
-1

Here's the shortest version to detect the CSV delimiter by using the SplFileObject class with the method getCsvControl and making use of the array dereference.

There's a catch though, the below function it will work only if the delimiter was set manually using the setCsvControl() function, otherwise, use one of the top rated answers.

// SplFileObject::getCsvControl — Get the delimiter, enclosure and escape character for CSV
function detectDelimiter($csvFile){
    if(!file_exists($csvFile) || !is_readable($csvFile)){
        return false;
    }   
    $file = new SplFileObject($csvFile);
    return $file->getCsvControl()[0]; 
}
nersi1
  • 109
  • 4
  • 1
    Thanks for taking the time to answer, but the question is asking if there's a better way to detect the delimiter in a file if you don't already know it. If you've used `setCSVControl()` or used the defaults, then you already know what it is. So I'm honestly curious what problem this answer is solvijng. – pbarney Jun 19 '20 at 19:42