0

I'am trying to merge 3 dataframes by index however so far unsuccessfully.

Here is the code:

import pandas as pd
from functools import reduce

#identifying csvs

x='/home/'
csvpaths = ("Data1.csv", "Data2.csv", "Data3.csv")
dfs = list()   # an empty list

#creating dataframes based on number of csvs

for i in range (len(csvpaths)):
    dfs.append(pd.read_csv(str(x)+ csvpaths[i],index_col=0))
print(dfs[1])

#creating suffix for each dataframe's columns
S=[]

for y in csvpaths:
    s=str(y).split('.csv')[0]
    S.append(s)
print(S)



#merging attempt
dfx = lambda a,b: pd.merge(a,b,on='SHIP_ID',suffixes=(S)), dfs

print(dfx)
print(dfx.columns)

if i try to export it as csv i get an error as follows(similar error when i try to print dfx.columns): 'tuple' object has no attribute 'to_csv'

the output i want is merger of the 3 dataframes as follows(with respective suffixes), please help.

[Note:table below is very simplified,original table consists of dozens of columns and thousands of rows, hence require practical merging method]

enter image description here

The Oracle
  • 388
  • 1
  • 9
  • 23

1 Answers1

1

Try:

for s,el in zip(suffixes, dfs):
    el.columns=[str(col)+s for col in el.columns]

dfx=pd.concat(dfs, ignore_index=True, sort=False, axis=1)

For the test case I used:

import pandas as pd

dfs=[pd.DataFrame({"x": [1,2,7], "y": list("ghi")}), pd.DataFrame({"x": [5,6], "z": [4,4]}), pd.DataFrame({"x": list("acgjksd")})]

suffixes=["_1", "_2", "_3"]

for s,el in zip(suffixes, dfs):
    el.columns=[str(col)+s for col in el.columns]

>>> pd.concat(dfs, ignore_index=True, sort=False, axis=1)

   x_1  y_1  x_2  z_2 x_3
0  1.0    g  5.0  4.0   a
1  2.0    h  6.0  4.0   c
2  7.0    i  NaN  NaN   g
3  NaN  NaN  NaN  NaN   j
4  NaN  NaN  NaN  NaN   k
5  NaN  NaN  NaN  NaN   s
6  NaN  NaN  NaN  NaN   d


Edit:

for s,el in zip(suffixes, dfs):
    el.columns=[str(col)+s for col in el.columns]
    el.set_index('ID', inplace=True)

dfx=pd.concat(dfs, ignore_index=False, sort=False, axis=1).reset_index()
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • 1
    The suffix bit worked. However regarding merging of data frames didn't work cause my index is a serial number and not identical for each data. Is there a way around it using merge or functools/reduce methods – The Oracle May 05 '20 at 07:43
  • Yes, try now - just `ignore_index=True` and `sort=False` – Grzegorz Skibinski May 05 '20 at 07:46
  • Hi, it didnt work. instead i created an initial merged dataframe for first 2 dfs, and then created a for loop for the remaining dfs (in the example above i have 3dfs, but im creating a code that has to deal with 3 or more),as follows: df=pd.merge(dfs[0],dfs[1] , on='ID', how='outer') L=0 for i in range(len(dfs)-2): L=L+1 df=pd.merge(df,dfs[L+1], on='ID', how='outer') – The Oracle May 05 '20 at 19:26
  • Hmm, it's a bit overkill, so you essentially want to do full outer join for all dataframes within an array, right? `ID` being a key? I think my **Edit** should do the trick then – Grzegorz Skibinski May 05 '20 at 20:03
  • yes, however,the data iam working with does not have equal rows for each dataframe,hence that might be reason for the error – The Oracle May 05 '20 at 20:09
  • Hm, that shouldn't matter really, what error are you getting? Also - have you tried the edit part (at the end of my answer)? – Grzegorz Skibinski May 05 '20 at 20:54
  • this is the error: raise ValueError(f"Shape of passed values is {passed}, indices imply {implied}") ValueError: Shape of passed values is (5331, 277), indices imply (5327, 277) – The Oracle May 05 '20 at 21:24
  • Doesn't sound like error from `pd.concat(...)` isn't this something with `pd.read_csv(...)` (not super familiar with using it, so I can just speculate, but maybe there are some unhandled empty lines, I would try to debug it first, by e.g printing every df, before you append it) – Grzegorz Skibinski May 05 '20 at 21:33
  • 1
    Ya i have tried the pd.concat on a more simplified list of dataframes and its working fine. not sure why im getting the error on the larger/more complex dfs might be due to reason u suggested. ill look into and see . – The Oracle May 05 '20 at 21:48