22

I have several csv files in a single folder and I want to open them all in one dataframe and insert a new column with the associated filename. So far I've coded the following:

import pandas as pd
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('path/*.csv'))))
df['filename']= os.path.basename(csv)
df

This gives me the dataframe I want but in the new column 'filename' it's only listing the last filename in the folder for every row. I'm looking for each row to be populated with it's associated csv file. Not just the last file in the folder.

Any assistance for this newbie is much appreciated.

AChampion
  • 29,683
  • 4
  • 59
  • 75
amwade2
  • 251
  • 1
  • 3
  • 6

2 Answers2

30

I think you need assign for add new column in loop, also parameter ignore_index=True was added to concat for remove duplicates in index:

Files for test are a.csv, b.csv, c.csv.

import pandas as pd
import glob, os


files = glob.glob('samples_for_so/*.csv')
print (files)
#['samples_for_so\\a.csv', 'samples_for_so\\b.csv', 'samples_for_so\\c.csv']


df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp)) for fp in files])
print (df)
   a  b  c  d    New
0  0  1  2  5  a.csv
1  1  5  8  3  a.csv
0  0  9  6  5  b.csv
1  1  6  4  2  b.csv
0  0  7  1  7  c.csv
1  1  3  2  6  c.csv

files = glob.glob('samples_for_so/*.csv')
df = pd.concat([pd.read_csv(fp).assign(New=os.path.basename(fp).split('.')[0]) 
       for fp in files])
print (df)
   a  b  c  d New
0  0  1  2  5   a
1  1  5  8  3   a
2  0  9  6  5   b
3  1  6  4  2   b
4  0  7  1  7   c
5  1  3  2  6   c
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Firstly, you have no csv variable defined.

But anyway, this behaviour makes sense, because you are using the csv at the end so it'll be set to the last file. Ideally, you can use glob again to get all filenames, then set that as a new column.

#this is a Python list containing filenames
csvs = glob.glob(os.path.join('path/*.csv'))

#now set the csv into a pd series
csv_paths = pd.Series(csvs)

df['file_name'] = csv_paths.values
Abid Hasan
  • 648
  • 4
  • 10
  • 7
    I get `ValueError: Length of values does not match length of index`, because each file have more as one row of data. – jezrael Mar 13 '17 at 07:22