If csv.Sniffer is not fit your needs, following up on @twalberg's idea, here's two, possible implementation of identify the right delimiter, but not just checking for common ,
,;
and |
delimiters, but universally identifying any kind of weird delimiters in csv-like files.
The naive approach
Below code reads the csv file's first 10 rows, grabs any non-alpha-numeric characters, and count the frequencies.
This code solely based on the law of greater numbers, that the most popular non-alphanumeric char should be generally the delimiter.
import re
from collections import Counter
def delimiter_detector(file_path):
sample_data = ""
with open(file_path, 'r') as file:
# Not loading the whole CSV into memory, just the first 10 rows
i = 0
while i < 10:
try:
sample_data += file.readline()
i += 1
except StopIteration:
break
non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
delimiters_frequency = Counter(non_alnum_chars)
if len(delimiters_frequency) == 0:
return None
# Find and return the most common delimiter
most_common_delimiter = delimiters_frequency.most_common(1)[0][0]
return most_common_delimiter
print(delimiter_detector('test.csv'))
Of course this would break, if let's say we have 5 columns delimited by |
(4 times / row), but also 5+ .
char in each row in the following 9 rows. {'|': 10*4, '.': 9*5}
Less naive approach
Hence, a better approach would be to first check and count all the special chars in the header/first row,the do the same in the succeeding rows.
After identify the special chars in the first row, then check, that out of those characters, which one's frequency is matching most of the time in the rest of the rows.
Continuing above example, even in the worst case, if we have 4 |
and 4 .
in the header row, which would imply that |
and .
can be the delimiter, by checking the next n
rows, generally '|':4
will be the most frequent, while .
and other special chars would vary.
import re
from collections import Counter
def frequency_counter(sample_data):
non_alnum_chars = re.findall(r'[^a-zA-Z0-9]', sample_data)
return dict(Counter(non_alnum_chars))
def delimiter_detector(file_path):
possible_delimiters = []
with open(file_path, 'r') as file:
# Not loading the whole CSV into memory, just the first 10 rows
i = 0
while i < 10:
try:
freqeunt_nonalpha = frequency_counter(file.readline().strip())
possible_delimiters.append(freqeunt_nonalpha)
i += 1
except StopIteration:
break
if len(possible_delimiters) == 0:
return None
# Find the most common delimiter in the header row
potential_delimiters = []
header_row = possible_delimiters[0]
# adding potential delimiter to the list if it's in the header row and the frequencies are equal
for data_row in possible_delimiters[1:]:
for data_row_delim in data_row:
if data_row_delim in header_row:
# if the header and data row frequencies are equal, it's a potential delimiter
if header_row[data_row_delim] == data_row[data_row_delim]:
potential_delimiters.append(data_row_delim)
# identify the most common potential delimiter
most_common_delimiter = Counter(potential_delimiters).most_common()
print(most_common_delimiter)
return most_common_delimiter[0][0][0]
print(delimiter_detector('test.csv'))
this approach would work, where the first, naive approach would fail.
c1|c2|c3|c4|c5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
The only case, where this would fail, if the other special char is present in the header row, and occurs in the other rows as well, exactly the same amount of times as the actual delimiter char:
c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
In this case .
is just as valid delimiter as |
. However, if there is even one row, where the count is not exactly the same as in the header row the latter approach would successfully identify the right delimiter. For that reason, it might worth to increase the number of rows to check.
c.1|c.2|c.3|c.4|c.5
a.|b.|c.|d.|e.
a.|b.|c.|d.|e.
a.|b.|c.|d..|e.
Some additional things to consider
The mentioned approaches assuming that the CSV file has a header row. If the file does not have a header, the method would treat the first data row as the header, leading to potential mistakes in delimiter detection.
Another thing that frequently causes issues for me, is the file encoding. Especially, if you're working on Windows, Excel likes to save it with Windows-1250
encoding. So make sure, that you are defined the right encoding when reading the file. To detect encoding, you can use the chardet library.