12

I would like to read multiple CSV files (with a different number of columns) from a target directory into a single Python Pandas DataFrame to efficiently search and extract data.

Example file:

Events 
1,0.32,0.20,0.67
2,0.94,0.19,0.14,0.21,0.94
3,0.32,0.20,0.64,0.32
4,0.87,0.13,0.61,0.54,0.25,0.43 
5,0.62,0.21,0.77,0.44,0.16

Here is what I have so far:

# get a list of all csv files in target directory
my_dir = "C:\\Data\\"
filelist = []
os.chdir( my_dir )
for files in glob.glob( "*.csv" ) :
    filelist.append(files)

# read each csv file into single dataframe and add a filename reference column 
# (i.e. file1, file2, file 3) for each file read
df = pd.DataFrame()
columns = range(1,100)
for c, f in enumerate(filelist) :
    key = "file%i" % c
    frame = pd.read_csv( (my_dir + f), skiprows = 1, index_col=0, names=columns )
    frame['key'] = key
    df = df.append(frame,ignore_index=True)

(the indexing isn't working properly)

Essentially, the script below is exactly what I want (tried and tested) but needs to be looped through 10 or more csv files:

df1 = pd.DataFrame()
df2 = pd.DataFrame()
columns = range(1,100)
df1 = pd.read_csv("C:\\Data\\Currambene_001y09h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
df2 = pd.read_csv("C:\\Data\\Currambene_001y12h00m_events.csv", 
                  skiprows = 1, index_col=0, names=columns)
keys = [('file1'), ('file2')]
df = pd.concat([df1, df2], keys=keys, names=['fileno'])

I have found many related links, however I am still not able to get this to work:

Community
  • 1
  • 1
mellover
  • 163
  • 2
  • 2
  • 6
  • `pandas.concat` would allow you to use a list of any length containing DataFrames. Feed the first argument with a single list containing all your files and you won't have to loop the script anymore. – dmvianna Jan 15 '14 at 23:45

1 Answers1

15

You need to decide in what axis you want to append your files. Pandas will always try to do the right thing by:

  1. Assuming that each column from each file is different, and appending digits to columns with similar names across files if necessary, so that they don't get mixed;
  2. Items that belong to the same row index across files are placed side by side, under their respective columns.

The trick to appending efficiently is to tip the files sideways, so you get the desired behaviour to match what pandas.concat will be doing. This is my recipe:

from pandas import *
files = !ls *.csv # IPython magic
d = concat([read_csv(f, index_col=0, header=None, axis=1) for f in files], keys=files)

Notice that read_csv is transposed with axis=1, so it will be concatenated on the column axis, preserving its names. If you need, you can transpose the resulting DataFrame back with d.T.

EDIT:

For different number of columns in each source file, you'll need to supply a header. I understand you don't have a header in your source files, so let's create one with a simple function:

def reader(f):
    d = read_csv(f, index_col=0, header=None, axis=1)
    d.columns = range(d.shape[1])
    return d

df = concat([reader(f) for f in files], keys=files)
dmvianna
  • 15,088
  • 18
  • 77
  • 106
  • That worked perfectly when each row had the same number of columns. Any suggestion for different columns lengths??? `d = pd.concat([(read_csv((TP_dir + f), index_col=0, header=0).T) for f in filelist], keys=filelist)` – mellover Jan 16 '14 at 00:41
  • 1
    That's a nice ipython trick! ... Can't you use axis=1 to concat to avoid .T? – Andy Hayden Jan 16 '14 at 00:42
  • @mellover, just give it a header. Either from the file or as a range, as you did in your script. – dmvianna Jan 16 '14 at 00:55
  • @AndyHayden, YES! I must have been either using pandas 0.12 or doing something else wrong when I tried axis=1 with `concat` and failed. Not it works. :) – dmvianna Jan 16 '14 at 01:08
  • Worked perfectly after this adjustment. Thanks – mellover Jan 16 '14 at 01:10