Can I import a CSV file and automatically infer the delimiter?

Question

I want to import two kinds of CSV files, some use ";" for delimiter and others use ",". So far I have been switching between the next two lines:

reader=csv.reader(f,delimiter=';')

or

reader=csv.reader(f,delimiter=',')

Is it possible not to specify the delimiter and to let the program check for the right delimiter?

The solutions below (Blender and sharth) seem to work well for comma-separated files (generated with Libroffice) but not for semicolon-separated files (generated with MS Office). Here are the first lines of one semicolon-separated file:

ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

Hello a more general discussion (not in python) is also in https://stackoverflow.com/questions/2789695/how-to-programmatically-guess-whether-a-csv-file-is-comma-or-semicolon-delimited — Lorenzo, Jan 03 '18 at 15:56

score 65 · Answer 1 · edited Jan 01 '18 at 10:13

The csv module seems to recommend using the csv sniffer for this problem.

They give the following example, which I've adapted for your case.

with open('example.csv', 'rb') as csvfile:  # python 3: 'r',newline=""
    dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=";,")
    csvfile.seek(0)
    reader = csv.reader(csvfile, dialect)
    # ... process CSV file contents here ...

Let's try it out.

[9:13am][wlynch@watermelon /tmp] cat example 
#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.reader(csvfile, dialect)

        for line in reader:
            print line

def main():
    print 'Comma Version:'
    parse('comma_separated.csv')

    print
    print 'Semicolon Version:'
    parse('semicolon_separated.csv')

    print
    print 'An example from the question (kingdom.csv)'
    parse('kingdom.csv')

if __name__ == '__main__':
    main()

And our sample inputs

[9:13am][wlynch@watermelon /tmp] cat comma_separated.csv 
test,box,foo
round,the,bend

[9:13am][wlynch@watermelon /tmp] cat semicolon_separated.csv 
round;the;bend
who;are;you

[9:22am][wlynch@watermelon /tmp] cat kingdom.csv 
ReleveAnnee;ReleveMois;NoOrdre;TitreRMC;AdopCSRegleVote;AdopCSAbs;AdoptCSContre;NoCELEX;ProposAnnee;ProposChrono;ProposOrigine;NoUniqueAnnee;NoUniqueType;NoUniqueChrono;PropoSplittee;Suite2LecturePE;Council PATH;Notes
1999;1;1;1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC;U;;;31999D0083;1998;577;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document
1999;1;2;1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes;U;;;31999D0081;1998;184;COM;NULL;CS;NULL;;;;Propos* are missing on Celex document

And if we execute the example program:

[9:14am][wlynch@watermelon /tmp] ./example 
Comma Version:
['test', 'box', 'foo']
['round', 'the', 'bend']

Semicolon Version:
['round', 'the', 'bend']
['who', 'are', 'you']

An example from the question (kingdom.csv)
['ReleveAnnee', 'ReleveMois', 'NoOrdre', 'TitreRMC', 'AdopCSRegleVote', 'AdopCSAbs', 'AdoptCSContre', 'NoCELEX', 'ProposAnnee', 'ProposChrono', 'ProposOrigine', 'NoUniqueAnnee', 'NoUniqueType', 'NoUniqueChrono', 'PropoSplittee', 'Suite2LecturePE', 'Council PATH', 'Notes']
['1999', '1', '1', '1999/83/EC: Council Decision of 18 January 1999 authorising the Kingdom of Denmark to apply or to continue to apply reductions in, or exemptions from, excise duties on certain mineral oils used for specific purposes, in accordance with the procedure provided for in Article 8(4) of Directive 92/81/EEC', 'U', '', '', '31999D0083', '1998', '577', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']
['1999', '1', '2', '1999/81/EC: Council Decision of 18 January 1999 authorising the Kingdom of Spain to apply a measure derogating from Articles 2 and 28a(1) of the Sixth Directive (77/388/EEC) on the harmonisation of the laws of the Member States relating to turnover taxes', 'U', '', '', '31999D0081', '1998', '184', 'COM', 'NULL', 'CS', 'NULL', '', '', '', 'Propos* are missing on Celex document']

It's also probably worth noting what version of python I'm using.

[9:20am][wlynch@watermelon /tmp] python -V
Python 2.7.2

That works for the comma-separated file but the semicolon-separated file cannot be read properly (Could not determine delimiter). See my edit above... — rom, May 01 '13 at 06:39
I've included an example with a comma separated, semicolon separated, and the example file that you suggest in the question. — Bill Lynch, May 01 '13 at 14:25
@sharth What is the file has to be delimited on either a space or a tab? Can this be done? — Sohaib, Aug 29 '14 at 07:29
@Sohaib: If you mean a single file that has both space and tab delimiters, then I don't know if csv supports that. But that is getting pretty far from the definition of a csv file. I would recommend either doing research, or asking a separate question here. — Bill Lynch, Aug 29 '14 at 14:26
This is not working for me. I have a CSV file which also has right and left square brackets ( [ , ] ) and sniffer could not detect this as delimiters. I have tried dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=',[]') — ghosh'., Jul 23 '15 at 06:09
@SoumyabrataGhosh: I encourage you to post a separate question with your issue. — Bill Lynch, Jul 24 '15 at 05:15

score 13 · Answer 2 · answered Jan 10 '16 at 23:13

Given a project that deals with both , (comma) and | (vertical bar) delimited CSV files, which are well formed, I tried the following (as given at https://docs.python.org/2/library/csv.html#csv.Sniffer):

dialect = csv.Sniffer().sniff(csvfile.read(1024), delimiters=',|')

However, on a |-delimited file, the "Could not determine delimiter" exception was returned. It seemed reasonable to speculate that the sniff heuristic might work best if each line has the same number of delimiters (not counting whatever might be enclosed in quotes). So, instead of reading the first 1024 bytes of the file, I tried reading the first two lines in their entirety:

temp_lines = csvfile.readline() + '\n' + csvfile.readline()
dialect = csv.Sniffer().sniff(temp_lines, delimiters=',|')

So far, this is working well for me.

This was so helpful to me! I was having trouble with data in which one of the "pegged" values was numbers with commas in them, so it kept failing. This limiting it to the first 2 lines really helped. — mauve, May 05 '16 at 18:08
Great, worked for me with my |-separated "csv" files. Thanks :) — 3isenHeim, Feb 13 '17 at 13:11

score 9 · Accepted Answer · edited Apr 29 '15 at 14:36

9

To solve the problem, I have created a function which reads the first line of a file (header) and detects the delimiter.

def detectDelimiter(csvFile):
    with open(csvFile, 'r') as myCsvfile:
        header=myCsvfile.readline()
        if header.find(";")!=-1:
            return ";"
        if header.find(",")!=-1:
            return ","
    #default delimiter (MS Office export)
    return ";"

edited Apr 29 '15 at 14:36

MERose

4,048
7
53
79

answered Sep 19 '13 at 18:06

rom

3,592
7
41
71

12

Your function will not work if the delimiter is part of a value, even if it is scaped or quoted. For example, a line like `"Hi Peter;", "How are you?", "Bye John!"` will return `;` as delimiter, which is wrong. – tashuhka Oct 06 '16 at 13:23

score 7 · Answer 4 · answered May 19 '16 at 11:56

7

And if you're using DictReader you can do that:

#!/usr/bin/env python
import csv

def parse(filename):
    with open(filename, 'rb') as csvfile:
        dialect = csv.Sniffer().sniff(csvfile.read(), delimiters=';,')
        csvfile.seek(0)
        reader = csv.DictReader(csvfile, dialect=dialect)

        for line in reader:
            print(line['ReleveAnnee'])

I used this with Python 3.5 and it worked this way.

answered May 19 '16 at 11:56

Vladir Parrado Cruz

2,301
21
27

1

I used it in python 2.7 – alvaro562003 Mar 01 '17 at 10:02

score 2 · Answer 5 · answered May 01 '13 at 15:08

I don't think there can be a perfectly general solution to this (one of the reasons I might use , as a delimiter is that some of my data fields need to be able to include ;...). A simple heuristic for deciding might be to simply read the first line (or more), count how many , and ; characters it contains (possibly ignoring those inside quotes, if whatever creates your .csv files quotes entries properly and consistently), and guess that the more frequent of the two is the right delimiter.

valq · Answer 6 · 2023-08-15T16:13:02.540

If csv.Sniffer is not fit your needs, following up on @twalberg's idea, here's two, possible implementation of identify the right delimiter, but not just checking for common ,,; and | delimiters, but universally identifying any kind of weird delimiters in csv-like files.

The naive approach

Below code reads the csv file's first 10 rows, grabs any non-alpha-numeric characters, and count the frequencies.

This code solely based on the law of greater numbers, that the most popular non-alphanumeric char should be generally the delimiter.

import re
from collections import Counter
def delimiter_detector(file_path):
    sample_data = ""
    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                sample_data += file.readline()
                i += 1
            except StopIteration:
                break

    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    delimiters_frequency = Counter(non_alnum_chars)
    if len(delimiters_frequency) == 0:
        return None

    # Find and return the most common delimiter
    most_common_delimiter = delimiters_frequency.most_common(1)[0][0]
    return most_common_delimiter

print(delimiter_detector('test.csv'))

Of course this would break, if let's say we have 5 columns delimited by | (4 times / row), but also 5+ . char in each row in the following 9 rows. {'|': 10*4, '.': 9*5}

Less naive approach

Hence, a better approach would be to first check and count all the special chars in the header/first row,the do the same in the succeeding rows.

After identify the special chars in the first row, then check, that out of those characters, which one's frequency is matching most of the time in the rest of the rows.

Continuing above example, even in the worst case, if we have 4 | and 4 . in the header row, which would imply that | and . can be the delimiter, by checking the next n rows, generally '|':4 will be the most frequent, while . and other special chars would vary.

import re
from collections import Counter

def frequency_counter(sample_data):
    non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
    return dict(Counter(non_alnum_chars))
def delimiter_detector(file_path):
    possible_delimiters = []

    with open(file_path, 'r') as file:
        # Not loading the whole CSV into memory, just the first 10 rows
        i = 0
        while i < 10:
            try:
                freqeunt_nonalpha = frequency_counter(file.readline().strip())
                possible_delimiters.append(freqeunt_nonalpha)
                i += 1
            except StopIteration:
                break


    if len(possible_delimiters) == 0:
        return None

    # Find the most common delimiter in the header row
    potential_delimiters = []
    header_row = possible_delimiters[0]
    # adding potential delimiter to the list if it's in the header row and the frequencies are equal
    for data_row in possible_delimiters[1:]:
        for data_row_delim in data_row:
            if data_row_delim in header_row:
                # if the header and data row frequencies are equal, it's a potential delimiter
                if header_row[data_row_delim] == data_row[data_row_delim]:
                    potential_delimiters.append(data_row_delim)

    # identify the most common potential delimiter
    most_common_delimiter = Counter(potential_delimiters).most_common()
    print(most_common_delimiter)
    return most_common_delimiter[0][0][0]

print(delimiter_detector('test.csv'))

this approach would work, where the first, naive approach would fail.

c1|c2|c3|c4|c5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

The only case, where this would fail, if the other special char is present in the header row, and occurs in the other rows as well, exactly the same amount of times as the actual delimiter char:

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.

In this case . is just as valid delimiter as |. However, if there is even one row, where the count is not exactly the same as in the header row the latter approach would successfully identify the right delimiter. For that reason, it might worth to increase the number of rows to check.

c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
a.|b.|c.|d..|e.

Some additional things to consider

The mentioned approaches assuming that the CSV file has a header row. If the file does not have a header, the method would treat the first data row as the header, leading to potential mistakes in delimiter detection.

Another thing that frequently causes issues for me, is the file encoding. Especially, if you're working on Windows, Excel likes to save it with Windows-1250 encoding. So make sure, that you are defined the right encoding when reading the file. To detect encoding, you can use the chardet library.

Can I import a CSV file and automatically infer the delimiter?

6 Answers6

The naive approach

Less naive approach

Some additional things to consider

Linked