14

In most cases, CSV files are text files with records delimited by commas. However, sometimes these files will come semicolon delimited. (Excel will use semicolon delimiters when saving CSVs if the regional settings has the decimal separator set as the comma -- this is common in Europe. Ref: http://en.wikipedia.org/wiki/Comma-separated_values#Application_support)

My question is, what is the best way to have a program guess whether to have it comma or semicolon separated?

e.g. a line like 1,1;1,1 may be ambiguous. It could be interpreted comma delimited as: 1 1;1 (a string) 1

or semicolon delimited as 1,1 1,1

My best guess so far is to try parsing the file both with , and ; delimiters, then choose the parse that has the most rows of the same length as the first row (usually a header row). If both have the same number of rows, choose the one with more columns. The main disadvantage of this is the extra overhead.

Thoughts?

Polemarch
  • 1,646
  • 14
  • 10
  • 3
    Count the number of delimiters in the file. It's a pretty sure bet that whichever delimiter is used the most is the actual delimiter. – Robert Harvey May 07 '10 at 15:32
  • 6
    One possible additional check is to see if splitting on a separator produces an equal amount of segments per row. – Alexander Torstling May 07 '10 at 15:40
  • Do you have any information as to the contents of the file? If so, you can probably use that to your advantage. If not, there's really no way to tell, and your best bet is to ask the user. – tloflin May 07 '10 at 15:43
  • 1
    I think your approach is a good one but, you can probably save time by limiting the comparison to the first 50 or so rows and if possible you should include some "failed to guess" threshold if for instance both delimiters seem plausible (eg. both delimiters parse as a rectangular table with more than 1 column) – Keith May 09 '10 at 15:19
  • @disown, that is exactly how I would handle it as well. – Mark Tomlin Aug 03 '10 at 03:19
  • 2
    Since you cannot guess that always correctly you should give the user a chance to override. – Henry Dec 13 '15 at 17:00
  • Note that for UTF16 encoded csv, Excel is defaulting to tab-separated values (at least for my language settings, you never know what others get) and is opening the file with all text in first column if I try comma or semicolon. – grek40 Feb 03 '17 at 08:40

5 Answers5

2

If every row should have the same number of columns, which I believe is the case with Excel, then, using both commas and semicolons, figure out the number of columns for lines N and N+1. Whichever method (commas or semicolons) produces a different answer is wrong (not the format of the file). You can start at the beginning and you only have to go until one of them is proven incorrect. You don't need header lines or anything. You don't have to read more of the file than is necessary, and it can't ever give you a wrong answer for the format of the file, it just might reach the end and not yet have come to a conclusion. All you need is for the every row has the same number of columns property to hold.

Jay Kominek
  • 8,674
  • 1
  • 34
  • 51
  • Hmm, thank you! I think the assumption that every row has the same number of columns may not be valid in some cases (I have painfully discovered that Excel sometimes doesn't follow this when the last column is blank), but your insight on not treating the header row as special is helpful (avoids assumptions there). Maybe the best thing to do is to try with both methods and pick the one that has the most rows that agree on the # of columns. – Polemarch Aug 07 '10 at 15:08
1

Depending on what you are working with, if you will guaranteeing have a header row, your approach of trying both, could be the best overall practice. Then once you determine what is going on, if you get to a row further down that doesn't have the required number of columns then you know that the format isn't correct.

Typically i would see this as a user specified option on upload, rather than a programmatic test.

Mitchel Sellers
  • 62,228
  • 14
  • 110
  • 173
1

You can read the first line

FileReader fileReader = new FileReader(filePath);
    BufferedReader bufferedReader = new BufferedReader(fileReader);
    String s = bufferedReader.readLine();
    String substring = s.substring(s.indexOf(firstColumnName) + 3, s.indexOf(firstColumnName) + 4);
    bufferedReader.close();
    fileReader.close();
    substring.charAt(0);

Then you capture this value

substring.charAt(0)

depending on whether the CSV is comma or semicolon can use the last value

eluish192
  • 145
  • 1
  • 10
0

This is my code (no validation on text)... perhaps it could help or make a base :-) !

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text.RegularExpressions;
using MoreLinq; // http://stackoverflow.com/questions/15265588/how-to-find-item-with-max-value-using-linq

namespace HQ.Util.General.CSV
{
    public class CsvHelper
    {
        public static Dictionary<LineSeparator, Func<string, string[]>>  DictionaryOfLineSeparatorAndItsFunc = new Dictionary<LineSeparator, Func<string, string[]>>();

        static CsvHelper()
        {
            DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Unknown] = ParseLineNotSeparated;
            DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Tab] = ParseLineTabSeparated;
            DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Semicolon] = ParseLineSemicolonSeparated;
            DictionaryOfLineSeparatorAndItsFunc[LineSeparator.Comma] = ParseLineCommaSeparated;
        }

        // ******************************************************************
        public enum LineSeparator
        {
            Unknown = 0,
            Tab,
            Semicolon,
            Comma
        }

        // ******************************************************************
        public static LineSeparator GuessCsvSeparator(string oneLine)
        {
            List<Tuple<LineSeparator, int>> listOfLineSeparatorAndThereFirstLineSeparatedValueCount = new List<Tuple<LineSeparator, int>>();

            listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Tab, CsvHelper.ParseLineTabSeparated(oneLine).Count()));
            listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Semicolon, CsvHelper.ParseLineSemicolonSeparated(oneLine).Count()));
            listOfLineSeparatorAndThereFirstLineSeparatedValueCount.Add(new Tuple<LineSeparator, int>(LineSeparator.Comma, CsvHelper.ParseLineCommaSeparated(oneLine).Count()));

            Tuple<LineSeparator, int> bestBet = listOfLineSeparatorAndThereFirstLineSeparatedValueCount.MaxBy((n)=>n.Item2);

            if (bestBet != null && bestBet.Item2 > 1)
            {
                return bestBet.Item1;
            }

            return LineSeparator.Unknown;
        }

        // ******************************************************************
        public static string[] ParseLineCommaSeparated(string line)
        {
            // CSV line parsing : From "jgr4" in http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html
            var matches = Regex.Matches(line, @"\s?((?<x>(?=[,]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^,]+)),?",
                                        RegexOptions.ExplicitCapture);

            string[] values = (from Match m in matches
                               select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();

            return values;
        }

        // ******************************************************************
        public static string[] ParseLineTabSeparated(string line)
        {
            var matchesTab = Regex.Matches(line, @"\s?((?<x>(?=[\t]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^\t]+))\t?",
                            RegexOptions.ExplicitCapture);

            string[] values = (from Match m in matchesTab
                                select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();

            return values;
        }

        // ******************************************************************
        public static string[] ParseLineSemicolonSeparated(string line)
        {
            // CSV line parsing : From "jgr4" in http://www.kimgentes.com/worshiptech-web-tools-page/2008/10/14/regex-pattern-for-parsing-csv-files-with-embedded-commas-dou.html
            var matches = Regex.Matches(line, @"\s?((?<x>(?=[;]+))|""(?<x>([^""]|"""")+)""|""(?<x>)""|(?<x>[^;]+));?",
                                        RegexOptions.ExplicitCapture);

            string[] values = (from Match m in matches
                               select m.Groups["x"].Value.Trim().Replace("\"\"", "\"")).ToArray();

            return values;
        }

        // ******************************************************************
        public static string[] ParseLineNotSeparated(string line)
        {
            string [] lineValues = new string[1];
            lineValues[0] = line;
            return lineValues;
        }

        // ******************************************************************
        public static List<string[]> ParseText(string text)
        {
            string[] lines = text.Split(new string[] { "\r\n" }, StringSplitOptions.None);
            return ParseString(lines);
        }

        // ******************************************************************
        public static List<string[]> ParseString(string[] lines)
        {
            List<string[]> result = new List<string[]>();

            LineSeparator lineSeparator = LineSeparator.Unknown;
            if (lines.Any())
            {
                lineSeparator = GuessCsvSeparator(lines[0]);
            }

            Func<string, string[]> funcParse = DictionaryOfLineSeparatorAndItsFunc[lineSeparator];

            foreach (string line in lines)
            {
                if (string.IsNullOrWhiteSpace(line))
                {
                    continue;
                }

                result.Add(funcParse(line));
            }

            return result;
        }

        // ******************************************************************
    }
}
Eric Ouellet
  • 10,996
  • 11
  • 84
  • 119
0

Let's say you have the following in your csv:

title,url,date,copyright,hdurl,explanation,media_type,service_version

then you can use python's in-built CSV module as follows:

import csv
data = "title,url,date,copyright,hdurl,explanation,media_type,service_version"
sn = csv.Sniffer()
delimiter = sn.sniff(data).delimiter

Printing the variable named delimiter will return ',' and this is the delimiter here. You can test by using some different delimiters.

dreygur
  • 300
  • 3
  • 12