1

Right now I'm creating a program which combines csv files into one with like columns not duplicated. The columns created would need to be added next the the adjacent column.

As of right now I'm able to get the files but I'm unable to determine a way to develop a way to iterate a data frame over each read csv and then merge all of these data frames together and push out a csv file. RIght now I'm testing out this with three csv files with a common ID column What I have right now is as follows:

        os.chdir(filedname)
        data = pd.merge([pd.DataFrame.from_csv(file) for
              file in glob.glob("*.csv")],on='ID')
        data.to_csv('merged.csv')

The files look like this:

(File 1)  (File 2)
ID  BLA   ID  X   
1   2     1   55
2   3     2   2
3   4     3   12
4   5     4   52

And each different column besides the ID column in each csv file in the directory should be merged with each other to create one csv file like this:

ID BLA X
1  2   55
2  3   2
3  4   12
4  5   52 

Any advice would be great in helping me solve this problem.

waughsh
  • 31
  • 1
  • 5
  • 1
    do you have some tiny example? For example, if I have 2 csv's with column 'B', but the data in this columns are different, what do you want to get in column 'B' for resulting DataFrame? – Roman Pekar Nov 13 '13 at 07:25
  • I added some extra info and an example! Sorry about that. – waughsh Nov 13 '13 at 20:36
  • ID's are different in two files, so do you want to join on ID or just add new columns without looking at ID column at all? – Roman Pekar Nov 14 '13 at 14:11
  • Oy vey, sorry again, I edited to make the ID the same. – waughsh Nov 14 '13 at 14:42

1 Answers1

2

simple example:

# Demo DataFrames
df1 = pd.DataFrame([[1,2,3],[2,3,4],[3,1,3]], columns=['ID','BLA','X'])
df2 = pd.DataFrame([[1,2,3],[2,5,4],[3,10,100]], columns=['ID','X','BLA'])
df3 = pd.DataFrame([[1,2,3],[2,8,7],[3,0,0]], columns=['ID','BLA','D'])

# Demo DataFrames sequence
dfs = [df1,df2,df3]

# Merge DataFrames
df = pd.DataFrame(columns=['ID'])
for d in dfs:
    cols = [x for x in d.columns if x not in df.columns or x == 'ID']
    df = pd.merge(df, d[cols], on='ID', how='outer', suffixes=['',''])

# result
   ID  BLA  X  D
0   1    2  3  3
1   2    3  4  7
2   3    1  3  0

in your case it could be something like:

data = [pd.DataFrame.from_csv(f) for f in glob.glob("*.csv")]
df = pd.DataFrame(columns=['ID'])
for d in data:
    cols = [x for x in d.columns if x not in df.columns or x == 'ID']
    df = pd.merge(df, d[cols], on='ID', how='outer', suffixes=['',''])    
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
  • I wanted to create a loop so it would add the columns from each CSV file located in a given directory to one main "merged" CSV. Does that help? – waughsh Nov 13 '13 at 21:12
  • My output is the first csv file (File 1) ID BLA 1 2 2 3 3 4 4 5 – waughsh Nov 14 '13 at 17:48
  • I also get this error AssertionError: cannot create BlockManager._ref_locs because block [FloatBlock: [ID, NCELLS, AREA, MEAN, STD, NCELLS_, AREA_, MEAN_, STD_, AREA_, MEAN_, STD_], 12 x 756, dtype float64] with duplicate items [Index([u'ID', u'NCELLS', u'AREA', u'MEAN', u'STD', u'NCELLS_', u'AREA_', u'MEAN_', u'STD_', u'NCELLS_', u'AREA_', u'MEAN_', u'STD_'], dtype=object)] does not have _ref_locs set – waughsh Nov 14 '13 at 17:55
  • @user2986038 changed again – Roman Pekar Nov 14 '13 at 19:15