0

I have three dataframes with row counts more than 71K. Below are the samples.

df_1 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001],'Col_A':[45,56,78,33]})
df_2 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887],'Col_B':[35,46,78,33,66]})
df_3 = pd.DataFrame({'Device_ID':[1001,1034,1223,1001,1887,1223],'Col_C':[5,14,8,13,16,8]})

Edit As suggested, below is my desired out put

df_final
Device_ID      Col_A    Col_B     Col_C
1001           45       35         5
1034           56       46         14
1223           78       78         8
1001           33       33         13
1887           Nan      66         16
1223           NaN      NaN        8

While using pd.merge() or df_1.set_index('Device_ID').join([df_2.set_index('Device_ID'),df_3.set_index('Device_ID')],on='Device_ID') it is taking very long time. One reason is repeating values of Device_ID.

I am aware of reduce method, but my suspect is it may lead to same situation.

Is there any better and efficient way?

pythondumb
  • 1,187
  • 1
  • 15
  • 30
  • How duplicated are the keys? You could be generating a massive DataFrame with a triple merge with overlapping keys, possibly generating trillions of rows, if not more. – ALollz Mar 09 '19 at 02:19
  • so what do you suggest? I should use `groupby` and then `merge`? – pythondumb Mar 09 '19 at 02:26
  • Please provide an example of output you are trying to achieve. What would you like to do about those duplicate values? – AlexK Mar 09 '19 at 02:41
  • Possible duplicate of https://stackoverflow.com/q/40575486/9987623 – AlexK Mar 09 '19 at 02:42

1 Answers1

2

To get your desired outcome, you can use this:

result = pd.concat([df_1.drop('Device_ID', axis=1),df_2.drop('Device_ID',axis=1),df_3],axis=1).set_index('Device_ID')

If you don't want to use Device_ID as index, you can remove the set_index part of the code. Also, note that because of the presence of NaN's in some columns (Col_A and Col_B) in the final dataframe, Pandas will cast non-missing values to floats, as NaN can't be stored in an integer array (unless you have Pandas version 0.24, in which case you can read more about it here).

AlexK
  • 2,855
  • 9
  • 16
  • 27