92

Processing CSV files with csv.DictReader is great - but I have CSV files with comment lines (indicated by a hash at the start of a line), for example:

# step size=1.61853
val0,val1,val2,hybridisation,temp,smattr
0.206895,0.797923,0.202077,0.631199,0.368801,0.311052,0.688948,0.597237,0.402763
-169.32,1,1.61853,2.04069e-92,1,0.000906546,0.999093,0.241356,0.758644,0.202382
# adaptation finished

The csv module doesn't include any way to skip such lines.

I could easily do something hacky, but I imagine there's a nice way to wrap a csv.DictReader around some other iterator object, which preprocesses to discard the lines.

bad_coder
  • 11,289
  • 20
  • 44
  • 72
Dan Stowell
  • 4,618
  • 2
  • 20
  • 30

5 Answers5

119

Actually this works nicely with filter:

import csv
fp = open('samples.csv')
rdr = csv.DictReader(filter(lambda row: row[0]!='#', fp))
for row in rdr:
    print(row)
fp.close()
Gourneau
  • 12,660
  • 8
  • 42
  • 42
Dan Stowell
  • 4,618
  • 2
  • 20
  • 30
  • 22
    That will read the whole file into memory. If it isn't too large then no problem, otherwise you might want to use a generator expression or `itertools.ifilter()`. – Duncan Jan 04 '13 at 16:10
  • 51
    ...or a generator expression: `csv.DictReader(row for row in fp if not row.startswith('#'))` – Andy Mikhailenko Jan 13 '14 at 07:03
  • 10
    @Duncan no need for itertools in Python3.6, as `filter()` will return an iterator by default, therefore the file will not be loaded into memory. – The Aelfinn Mar 02 '18 at 19:03
  • pretty sure @Andy Mikhaylenko's generator expression worked really well but it doesn't any more. what up? (Python 3.7.5) – Ulf Gjerdingen Jan 19 '22 at 19:59
26

Good question. Python's CSV library lacks basic support for comments (not uncommon at the top of CSV files). While Dan Stowell's solution works for the specific case of the OP, it is limited in that # must appear as the first symbol. A more generic solution would be:

def decomment(csvfile):
    for row in csvfile:
        raw = row.split('#')[0].strip()
        if raw: yield raw

with open('dummy.csv') as csvfile:
    reader = csv.reader(decomment(csvfile))
    for row in reader:
        print(row)

As an example, the following dummy.csv file:

# comment
 # comment
a,b,c # comment
1,2,3
10,20,30
# comment

returns

['a', 'b', 'c']
['1', '2', '3']
['10', '20', '30']

Of course, this works just as well with csv.DictReader().

sigvaldm
  • 564
  • 4
  • 15
  • 2
    I believe you meant "yield row" not "yield raw" in the decomment() function. A CSV file can contain # characters in a string and it is perfectly valid. – Thibault Reuille Apr 01 '20 at 19:48
  • 1
    @ThibaultReuille: It is true that many CSV files can contain # in strings, although the CSV format is not well standardized. I meant `yield raw`. My suggestion would not deal with # in strings in any case. – sigvaldm Apr 02 '20 at 10:47
  • @ThibaultReuille: What you're pointing at is exactly why it is inadvisable to manually type a lot of code for something a library can do for you; you probably won't get all the details right the first time (for instance, you could also have newlines in strings), and it will take away time from the task you're actually solving. I consider my solution a quick fix for something that ought to have been in `csv`. If it would need considerable expansion to work for you, perhaps you should consider another csv library, for instance the one in pandas. Hope that helps. – sigvaldm Apr 02 '20 at 10:50
  • Nice, this suits my purposes as well, as it also strips out blank lines. +1! – Brian A. Henning Aug 31 '23 at 15:55
14

Another way to read a CSV file is using pandas

Here's a sample code:

df = pd.read_csv('test.csv',
                 sep=',',     # field separator
                 comment='#', # comment
                 index_col=0, # number or label of index column
                 skipinitialspace=True,
                 skip_blank_lines=True,
                 error_bad_lines=False,
                 warn_bad_lines=True
                 ).sort_index()
print(df)
df.fillna('no value', inplace=True) # replace NaN with 'no value'
print(df)

For this csv file:

a,b,c,d,e
1,,16,,55#,,65##77
8,77,77,,16#86,18#
#This is a comment
13,19,25,28,82

we will get this output:

       b   c     d   e
a                     
1    NaN  16   NaN  55
8   77.0  77   NaN  16
13  19.0  25  28.0  82
           b   c         d   e
a                             
1   no value  16  no value  55
8         77  77  no value  16
13        19  25        28  82
Granny Aching
  • 1,295
  • 12
  • 37
  • 2
    `pandas` is indeed a powerful library, yet it is a dependency that require setup and learning to use. Moreover, the author had already stated in the question that he simply wanted to use the built-in `csv.DictReader` module and relevant answers were provided years ago already. I don't understand why you add this solution as an alternative. – Lacek May 28 '19 at 13:45
  • 7
    The author of the question might not need pandas. But the purpose of this forum is more than just help each question's author with their specific problem. – Granny Aching May 28 '19 at 13:51
  • @GrannyAching What exactly does `.sort_index()` achieve here? :) – Micheal J. Roberts Sep 20 '20 at 10:54
1

based on sigvaldm and Leonid

def is_comment(line):
    return line.startswith('#')

def is_whitespace(line):
    return line.isspace()

def decomment(csvfile):
    for row in csvfile:
        if is_comment(row) == False and is_whitespace(row) == False:
            yield row

with open('dummy.csv') as csvfile:
    reader = csv.reader(decomment(csvfile))
    for row in reader:
        print(row)
sailfish009
  • 2,561
  • 1
  • 24
  • 31
-1

Just posting the bugfix from @sigvaldm's solution.

def decomment(csvfile):
for row in csvfile:
    raw = row.split('#')[0].strip()
    if raw: yield row

with open('dummy.csv') as csvfile:
    reader = csv.reader(decomment(csvfile))
    for row in reader:
        print(row)

A CSV line can contain "#" characters in quoted strings and is perfectly valid. The previous solution was cutting off strings containing '#' characters.