2

I have two pandas dataframes: one:

import pandas as pd
df1 = pd.read_csv('filename1.csv')
df1
     A     B
0    1     22
1    2     15
2    5     99
3    6     1
....

and two

df2 = pd.read_csv('filename1.csv')
df2
     A     B
0    1     6
1    3     52
2    4     15
3    5     62
...    

I would like to merge these dataframes into a single dataframe, with column A as the index for this new dataframe.

The columns are filenames, the rows are the values for 'A'.

If values do not exist for these index, NaN then exists. The column names should be the filenames from the *csv above.

    filename1    filename2   
1   22           6
2   15           NaN
3   NaN          52
4   NaN          15
5   99           62
6   1            NaN

How does one do this? For two files, one could use pandas.merge(), but what is dozens of the orginal dataframes exists?

ShanZhengYang
  • 16,511
  • 49
  • 132
  • 234

1 Answers1

2
files = ['file1', 'file2']

def read(f):
    f = f + '.csv'
    df = pd.read_csv(f, usecols=['A', 'B'])
    return df.drop_duplicates(subset=['A']).set_index('A').B

pd.concat([read(f) for f in files], axis=1, keys=files)

enter image description here

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • your solution is much more precise, i wasn't careful enough, when reading OP's question... – MaxU - stand with Ukraine Aug 02 '16 at 22:06
  • @piRSquared Thanks for this. When I try this for multiple files, I simply get multiple dataframes into one dataframe, with both column A and B include per each file. For multiple files, how do I explicitly index on column A? – ShanZhengYang Aug 02 '16 at 22:36
  • There are some files where A is not the first column. Let's say I merge 20. Would it be easier to simply pre-process the files first and drop these (unnecessary) files such that A is the first? – ShanZhengYang Aug 02 '16 at 22:53
  • @piRSquared Yes, the other columns could be dropped. This becomes clumsy for many files. I suppose you should define a function. – ShanZhengYang Aug 02 '16 at 23:00
  • I get the error: "InvalidIndexError: Reindexing only valid with uniquely valued Index objects" – ShanZhengYang Aug 02 '16 at 23:11
  • @ShanZhengYang you must have non-unique column A. Try updated post. – piRSquared Aug 02 '16 at 23:13