1

I have csv files. Some are comma delimited, and some are tab delimited.

df = pd.read_csv(data_file, sep='\t')

Is there a way to specify either tab or comma as delimiter when using pd.read_csv()? Or, is there a way to automatically detect whether the file is tab or comma delimited? If I know that, I can use different sep='' paramters when reading the file.

marlon
  • 6,029
  • 8
  • 42
  • 76
  • CSV stands for "comma-separated values" – Erich Jan 26 '21 at 22:02
  • @Erich, I have tab separated files too with extension of '.csv' – marlon Jan 26 '21 at 22:03
  • I'm not sure what operating system you are using but file extensions are actually very ephemeral. I can label any file I want in any way I want, and I can make programs that label them how I like as well. https://tools.ietf.org/html/rfc4180#section-2 – Erich Jan 26 '21 at 22:06
  • Did you solve this ? I'm curious to know – Umar.H Feb 01 '21 at 11:59

4 Answers4

4

Recently I had a similair problem, I ended up using a different method but I explored using the Sniffer Class from the CSV standard library.

I haven't used this in production but only to help find what file types are for testing prototyping, use at your own risk!

from the documentation

"Sniffs" the format of a CSV file (i.e. delimiter, quotechar) Returns a Dialect object.

you can return the dialect object then pass dialect.delimiter to the sep arg in pd.read_csv

'text_a.csv'

cola|colb|col
A|B|C
E|F|G
A|B|C
E|F|G

'text_b.csv'

cola\tcolb\tcol
A\tB\tC
E\tF\tG
A\tB\tC
E\tF\tG
A\tB\tC

from csv import Sniffer

sniffer = Sniffer()

def detect_delim(file,num_rows,sniffer):

    with open(file,'r') as f:
        for row in range(num_rows):
            line = next(f).strip()
            delim = sniffer.sniff(line)
    
    print(delim.delimiter) # ideally you should return the dialect object - just being lazy.
    #return delim.dedelimiter

detect_delim(file='text_a.csv',num_rows=5,sniffer=sniffer)
'|'
detect_delim(file='text_b.csv',num_rows=5,sniffer=sniffer)
'\t'
Umar.H
  • 22,559
  • 7
  • 39
  • 74
0

I'd just read the first row and see which gives you more columns:

import pandas as pd
tab = pd.read_csv(data_file, nrows=1, sep='\t').shape[1]
com = pd.read_csv(data_file, nrows=1, sep=',').shape[1]
if tab > com:
    df = pd.read_csv(data_file, sep='\t')
else:
    df = pd.read_csv(data_file, sep=',')
itwasthekix
  • 585
  • 6
  • 11
  • what if the first string in one of the rows is correctly a string but has 50,000 tabs – Erich Jan 26 '21 at 22:06
  • If you know how many columns you are supposed to have you could change the if statement to that e.g. ```if tab == 8: elif com == 8: ```. Or you could set a max, e.g. ```if (tab > com) and (tab < 100): else: ```. – itwasthekix Jan 26 '21 at 22:09
  • That still misses the case where `tab == N` and `com == 8` are both true. It also misses the case where a bad file is given. This is not a tractable problem, nothing stops a user from putting the entirety of the bee movie into your program, how you handle it and what assumptions you make are what you control. You can attempt to split on both and look to see if the fields are valid, but that still leaves a few edge cases. Or you can ask the user to be explicit and reject inputs otherwise. – Erich Jan 26 '21 at 23:10
  • I mean this is definitely not production level code, but it will work just to get the ball rolling if you can make some mild assumptions about the data you have. – itwasthekix Jan 26 '21 at 23:18
0

Is this useful? You can used python regex parser with read_csv and specify different delimiters.

tboschi
  • 124
  • 11
0

Ask the user to specify how the file is formatted if you don't expect to be able to determine from the file contents itself.

E.g. a flag of some sort as --tab-delimited-file=true and then you flip the separator based on their input.

Erich
  • 1,902
  • 1
  • 17
  • 23
  • Or, you can kindly ask everyone to use the `*.tsv` extension when using tabs over commas. – Erich Jan 26 '21 at 23:28