1

I would like to join two dataframes (df1,df2), but i can't figure it.

import pandas as pd  
data = {'Step_ID': ["Step1", "Step1", "Step1", "Step2", "Step2", "Step3", "Step3"],    
        'value_01': [2, 2.3, 2.2, 0, 0, 5, 5.2]}  
df1 = pd.DataFrame(data) 

data = {'Step_ID': ["Step1", "Step1", "Step1", "Step1", "Step2", "Step2", "Step2", "Step3", "Step3", "Step3"],    
        'value_02': [2.3, 2.5, 2.1, 2.5, 0, 0, 0, 5.1, 5.6, 5.8]}  
df2 = pd.DataFrame(data) 

I would like to merge the on the column "Step_ID" as follows:

enter image description here

I tried several merges and its settings, but without any sucess.

pd.merge(df1, df2, left_on = ['Step_ID'], right_on = ['Step_ID'], how = 'outer')

The closest solution i have done with the following code, but it is not as required:

df1.combine_first(df2)

Is there any possibility to join those two dataframe in the required way? See the picture above.

Parsyk
  • 321
  • 1
  • 3
  • 11

1 Answers1

1

We can try with cumcount + merge

new_df = df1.assign(index_count=df1.groupby('Step_ID').cumcount())\
            .merge(df2.assign(index_count=df2.groupby('Step_ID').cumcount()),
                   on=['Step_ID', 'index_count'], how='outer')\
            .sort_values(['Step_ID', 'index_count'])\
            .drop('index_count', axis=1)

print(new_df)

  Step_ID  value_01  value_02
0   Step1       2.0       2.3
1   Step1       2.3       2.5
2   Step1       2.2       2.1
7   Step1       NaN       2.5
3   Step2       0.0       0.0
4   Step2       0.0       0.0
8   Step2       NaN       0.0
5   Step3       5.0       5.1
6   Step3       5.2       5.6
9   Step3       NaN       5.8
ansev
  • 30,322
  • 5
  • 17
  • 31
  • Thank you, it works perfect. I would like to import multiple csv files into pandas and merge into one DataFrame as above - not only df1 and df2. Hier is the solution: https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe, but i am not sure, how to manage it. Could you please help me out? – Parsyk May 12 '21 at 21:38
  • you just have to read each one with `pd.read_csv` df1 = pd.read_csv('path d1') ... – ansev May 12 '21 at 21:42
  • Sorry, i did not get it: `import glob path = r'C:\Users\xxx' all_files = glob.glob(path + "/df*.csv") li = [] for filename in all_files: df1 = pd.read_csv(filename) li.append(df1) frame = df1.assign(index_count=df1.groupby('Step_ID').cumcount())\ .merge(df1.assign(index_count=df1.groupby('Step_ID').cumcount()), on=['Step_ID', 'index_count'], how='outer')\ .sort_values(['Step_ID', 'index_count'])\ .drop('index_count', axis=1)` How to transform df1 and df2 from your code to df1 only? – Parsyk May 12 '21 at 22:15