34

I am looking for a a way to read just the header row of a large number of large CSV files.

Using Pandas, I have this method available, for each csv file:

>>> df = pd.read_csv(PATH_TO_CSV)
>>> df.columns

I could do this with just the csv module:

>>> reader = csv.DictReader(open(PATH_TO_CSV))
>>> reader.fieldnames

The problem with these is that each CSV file is 500MB+ in size, and it seems to be a gigantic waste to read in the entire file of each just to pull the header lines.

My end goal of all of this is to pull out unique column names. I can do that once I have a list of column headers that are in each of these files.

How can I extract only the header row of a CSV file, quickly?

Andy
  • 49,085
  • 60
  • 166
  • 233
  • 4
    Note that `DictReader` doesn't read the entire file... so you could just use that iteratively over the files required and build a set... I'm done something similar in an answer I've made... – Jon Clements Jul 25 '14 at 19:17

10 Answers10

37

Expanding on the answer given by Jeff It is now possbile to use pandas without actually reading any rows.

In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: pd.DataFrame(np.random.randn(10, 4), columns=list('abcd')).to_csv('test.csv', mode='w')

In [4]: pd.read_csv('test.csv', index_col=0, nrows=0).columns.tolist()
Out[4]: ['a', 'b', 'c', 'd']

pandas can have the advantage that it deals more gracefully with CSV encodings.

Jarno
  • 6,243
  • 3
  • 42
  • 57
  • 1
    Hi and great tip. I found replacing `index_col` with `header` got me an additional field name that I was missing. Otherwise, the rest works perfectly! – Mark Moretto Nov 17 '20 at 21:31
  • 1
    @MarkMoretto I think it depends upon whether you have an extra index column without a header in your CSV or not. If not it is probably clearest to set `index_col=False` as `header=0` is already kind of the default. – Jarno Nov 19 '20 at 06:49
27

I might be a little late to the party but here's one way to do it using just the Python standard library. When dealing with text data, I prefer to use Python 3 because unicode. So this is very close to your original suggestion except I'm only reading in one row rather than the whole file.

import csv    

with open(fpath, 'r') as infile:
    reader = csv.DictReader(infile)
    fieldnames = reader.fieldnames

Hopefully that helps!

Tyler
  • 1,050
  • 2
  • 14
  • 24
15

I've used iglob as an example to search for the .csv files, but one way is to use a set, then adjust as necessary, eg:

import csv
from glob import iglob

unique_headers = set()
for filename in iglob('*.csv'):
    with open(filename, 'rb') as fin:
        csvin = csv.reader(fin)
        unique_headers.update(next(csvin, []))
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I compared this to the answer Jeff provided. This one is running about 5 times faster than the pandas answer for a sample of my data set. I suspect it's because it doesn't read the extra data row (I appreciate the note about DictReader as well). Thanks – Andy Jul 25 '14 at 19:47
  • @Andy I suspect the real difference is not really the un-necessary reading of an additional row, but more the overhead of creating a DataFrame to do so... – Jon Clements Jul 25 '14 at 19:50
  • May I know what this sentence means ? "unique_headers.update(next(csvin, []))" @JonClements – Erik Johnsson Jun 14 '20 at 02:39
13

Here's one way. You get 1 row.

In [9]: DataFrame(np.random.randn(10,4),columns=list('abcd')).to_csv('test.csv',mode='w')

In [10]: read_csv('test.csv',index_col=0,nrows=1)
Out[10]: 
          a         b         c         d
0  0.365453  0.633631 -1.917368 -1.996505
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • This does read an un-necessary row though for the sake of reading the header... but maybe I'm not entirely clear on what the OP wishes – Jon Clements Jul 25 '14 at 19:44
  • I appreciate the answer, Jeff. I compared your answer to the one provided by Jon. Both work, but this one runs about 5 times slower than the one he provided. – Andy Jul 25 '14 at 19:48
  • @Jon Clements OP need only headers but `read_csv()` doesn't run with `nrows=0` - `read_csv()` needs to read at least one row. – furas Jul 25 '14 at 19:48
  • @Andy if that matter to you then use the other soln. This is the pandas method. – Jeff Jul 25 '14 at 19:50
  • 2
    @Jeff & Jon Clements: I think you could add `header=None` to get headers as normal row - without first row of data. – furas Jul 25 '14 at 19:55
  • @furas indeed - but still creating a DataFrame without need – Jon Clements Jul 25 '14 at 19:55
  • @furas that's right. FYI you can do this much easier by just ``head -n 1 *.csv`` in linux – Jeff Jul 25 '14 at 19:57
  • @Jeff yup... lots more other ways to do this... op asked for Python though, so err... yeah... – Jon Clements Jul 25 '14 at 19:59
  • You are both right :) And I was thinking only about solution with pandas. – furas Jul 25 '14 at 20:00
12

What about:

pandas.read_csv(PATH_TO_CSV, nrows=1).columns

That'll read the first row only and return the columns found.

mdubez
  • 3,024
  • 1
  • 17
  • 10
7

you have missed nrows=1 param to read_csv

>>> df= pd.read_csv(PATH_TO_CSV, nrows=1)
>>> df.columns
Saurabh Chandra Patel
  • 12,712
  • 6
  • 88
  • 78
1

it depends on what the header will be used for, if you needed the headers for comparison purposes only (my case) this code will be simple and super fast, it will read the whole header as one string. you can transform all the collected strings together according to your needs:

for filename in glob.glob(files_path+"\*.csv"):
    with open(filename) as f:
        first_line = f.readline()
1

it is easy you can use this:

df = pd.read_csv("path.csv", skiprows=0, nrows=2)
df.columns.to_list()

In this case you can only read really few row for get your header

Sway Wu
  • 379
  • 3
  • 8
1

if you are only interested in the headers and would like to use pandas, the only extra thing you need to pass in apart from the csv file name is "nrows=0":

headers = pd.read_csv("test.csv", nrows=0)
blessedk
  • 71
  • 7
0
import pandas as pd

get_col = list(pd.read_csv("first_test_pipe.csv",sep="|",nrows=1).columns)
print(get_col)
Theo
  • 57,719
  • 8
  • 24
  • 41