2

In Pandas merge function you can merge two data frames, but I need to merge N, similar to an SQL statement where you combine N tables in a full outer join. For example, I need to merge the three data frames below by 'type_1', 'subject_id_1', 'type_2', 'subject_id_2' and 'type_3', 'subject_id_3'. Is this possible?

import pandas as pd

raw_data = {
        'type_1': [1, 1, 0, 0, 1],
        'subject_id_1': ['1', '2', '3', '4', '5'],
        'first_name_1': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung']}
df_a = pd.DataFrame(raw_data, columns = ['type_1', 'subject_id_1', 'first_name_1'])

raw_datab = {
        'type_2': [1, 1, 0, 0, 0],
        'subject_id_2': ['4', '5', '6', '7', '8'],
        'first_name_2': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty']}
df_b = pd.DataFrame(raw_datab, columns = ['type_2', 'subject_id_2', 'first_name_2'])

raw_datac = {
        'type_3': [1, 1],
        'subject_id_3': ['4', '5'],
        'first_name_3': ['Joe', 'Paul']}
df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'subject_id_3', 'first_name_3'])

### need to include here the third data frame
merged = pd.merge(df_a, df_b, left_on=['type_1','subject_id_1'], 
                 right_on = ['type_2','subject_id_2'], how='outer')  
print(merged)

Note: The names of the fields to join are different in each data frame.

ps0604
  • 1,227
  • 23
  • 133
  • 330

1 Answers1

2

I believe need join by indices created by set_index with concat:

dfs = [df_a.set_index(['type_1','subject_id_1']),
       df_b.set_index(['type_2','subject_id_2']),
       df_c.set_index(['type_3','subject_id_3'])]

df = pd.concat(dfs, axis=1)
print (df)
    first_name_1 first_name_2 first_name_3
0 3        Allen          NaN          NaN
  4        Alice          NaN          NaN
  6          NaN         Bran          NaN
  7          NaN        Bryce          NaN
  8          NaN        Betty          NaN
1 1         Alex          NaN          NaN
  2          Amy          NaN          NaN
  4          NaN        Billy          Joe
  5       Ayoung        Brian         Paul

df = pd.concat(dfs, axis=1).rename_axis(('type','subject_id')).reset_index()
print (df)
   type subject_id first_name_1 first_name_2 first_name_3
0     0          3        Allen          NaN          NaN
1     0          4        Alice          NaN          NaN
2     0          6          NaN         Bran          NaN
3     0          7          NaN        Bryce          NaN
4     0          8          NaN        Betty          NaN
5     1          1         Alex          NaN          NaN
6     1          2          Amy          NaN          NaN
7     1          4          NaN        Billy          Joe
8     1          5       Ayoung        Brian         Paul
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • When I run your `df = pd.concat(dfs, axis=1)` example, `first_name_3` is always `NaN`, did you run your example with the data? – ps0604 Apr 10 '18 at 12:36
  • @ps0604 - I use your data, only in last I use `df_c = pd.DataFrame(raw_datac, columns = ['type_3', 'subject_id_3', 'first_name_3'])` instead `df_c = pd.DataFrame(raw_datab, columns = ['type_3', 'subject_id_3', 'first_name_3'])` – jezrael Apr 10 '18 at 12:39
  • there is typo `raw_datac` – jezrael Apr 10 '18 at 12:39
  • 1
    Thanks, I fixed the typo – ps0604 Apr 10 '18 at 12:47
  • I need to get the concat result like an SQL response where all the columns are present. Pandas removes from each input data frame the columns that are in the index. Is there a function in Pandas that retrieves all the columns, including the index columns? – ps0604 Apr 11 '18 at 02:49
  • @ps0604 - Do you need finally ouput like `merged` only added `df_c` ? I think it is not possible in pandas, because native dont supported. Also `merge` by multiple dataframes is possible only in `reduce` hack in same columns names, also not native supported. So sorry. :( – jezrael Apr 11 '18 at 06:37