2

I have a list with different dataframes.I need to merge the dataframes with the same column names into one and write it to a CSV output or single dataframe.

I have managed to get the column names,but not able to think of good logic to perform the above operation.

My case:

There is MERGED_LIST with 9 dataframe elements.The column names are as below

array(['A','B','C']) - 10 rows
array(['A','B','C']) - 15 rows
array(['W','X','Y','Z']) - 10 rows
array(['W','X','Y','Z']) - 20 rows
array(['W','X','Y','Z']) - 45 rows
array(['W','X','Y','Z']) - 30 rows
array(['W','X','Y','Z']) - 5 rows
array(['H']) - 50 rows

Final outpout needed:

CSV1 or DF1:

A B C 

with 25 rows

CSV2 or DF2:

W X Y Z

with 110 rows

CSV3 or DF3:

H 

with 50 rows

marupav
  • 345
  • 1
  • 3
  • 15
  • please provide [minimal complete example](http://stackoverflow.com/help/mcve) – Sergey Antopolskiy Jan 19 '16 at 09:26
  • Did you try to use `pd.concat` ? – Anton Protopopov Jan 19 '16 at 09:26
  • @AntonProtopopov...I have considered pd.concat for individual dataframes and got it earlier.Here,I am looking for some help in column matching (as they are arrays) logic .I have an idea at high level,but as I am new to python not able to frame it properly.My logic is as below: 1. Add first dataframe separately. 2.Read columns of second dataframe and compare to first one 3.If columns match then append to earlier dataframe. 4 Loop again 5.If columns dont match create separate dataframe and loop until achieved.Please help – marupav Jan 19 '16 at 09:32
  • @SergeyAntopolskiy I am working on it and will post – marupav Jan 19 '16 at 09:33
  • @marupav So let me get this straight. You need an automated way to see which of the original dataframes match columns, and output separate combined dataframes for each of those clusters? Another question: is there ever a partial match (e.g. on column is matched, which the others are not)? What do you want to do in this case? – Sergey Antopolskiy Jan 19 '16 at 09:38
  • @SergeyAntopolskiy ..You got it right.I am trying to automate the same.There are no partial matches to large extent (there may be about 5% or so which is not a problem).Columns match on complete set whether there are 2 , 5 or single column (repetitive pattern is observed). – marupav Jan 19 '16 at 09:41
  • In above case, A B C/ W X Y Z etc are column names – marupav Jan 19 '16 at 09:48
  • @marupav in this case I think the most elegant way would be to start with pd.concat, which will pull all of your original dataframes into a big one, merging them on column names, and then work on splitting it into dataframes without NaNs. There must be an efficient way to do it in pandas, but I cannot think of one right now. – Sergey Antopolskiy Jan 19 '16 at 09:51
  • Will try it.But the challenge I see that number of columns are varying in each case and combing into one has to be checked. – marupav Jan 19 '16 at 09:59
  • @marupav you can definitely trust pandas on combining with pd.concat(,axis=0). It will do its job. Now, the varying number of columns presents a challenge, but it shouldn't matter if you find a good algorithm. I would start by exploring this direction: select a column, run method notnull(), then check result against the same for the other columns -- this will give you which columns group together. then splitting is a matter technique. – Sergey Antopolskiy Jan 19 '16 at 10:24
  • dfx = merged_list[0] glist = [] for i in range(1,len(merged_list)): if np.array_equal(merged_list[i-1].columns.values,merged_list[i].columns.values): dfx = pd.concat([dfx,merged_list[i]], axis=0) else: dfx=pd.DataFrame() dfx = merged_list[i] glist.append(dfx) – marupav Jan 19 '16 at 10:28
  • Tried the above code,but it is not giving the desired results after 2nd dataframe match.Guess something is wrong with logic. – marupav Jan 19 '16 at 10:29

1 Answers1

0

Not sure if this is what you are looking for. Here's my response. Let me know if this helps. If not, please provide more details so we can help you.

I know this is an old post but wanted to see if this is what you were looking for.

I would also recommend that you review the best post I found on Pandas Merging 101 and Pandas .concat responses.

import pandas as pd

df1 = pd.DataFrame({'A':['A1']*10,
                   'B':['B1']*10,
                   'C':['C1']*10})

df2 = pd.DataFrame({'A':['A2']*15,
                   'B':['B2']*15,
                   'C':['C2']*15})

df3 = pd.DataFrame({'W':['W3']*10,
                   'X':['X3']*10,
                   'Y':['Y3']*10,
                   'Z':['Z3']*10})

df4 = pd.DataFrame({'W':['W4']*20,
                   'X':['X4']*20,
                   'Y':['Y4']*20,
                   'Z':['Z4']*20})

df5 = pd.DataFrame({'W':['W5']*45,
                   'X':['X5']*45,
                   'Y':['Y5']*45,
                   'Z':['Z5']*45})

df6 = pd.DataFrame({'W':['W6']*30,
                   'X':['X6']*30,
                   'Y':['Y6']*30,
                   'Z':['Z6']*30})

df7 = pd.DataFrame({'W':['W7']*5,
                   'X':['X7']*5,
                   'Y':['Y7']*5,
                   'Z':['Z7']*5})

dfH = pd.DataFrame({'H':['H8']*50})

dfABC  = df1.merge(df2, on=['A','B','C'],how='outer')
dfWXYZ = pd.concat([df3,df4,df5,df6,df7], axis=0,sort=False)

print(dfABC)
print(dfWXYZ)
print(dfH)

The output will be as follows:

dfABC results in 25 rows (10 + 15)

dfWXYZ results in 110 rows (10 + 20 + 45 + 30 + 5)

dfH results in 50 rows
Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33