Group by columns while merging data frames

Question

I am merging several data frames into one data frame in a for loop. something like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('GHABC'))
df1 = df1.merge(df2, how='outer', left_index=True, right_index=True,suffixes=('', '_' + 'second'))

Several columns from each of these data frames have similar names so each time there is such a column I am adding a suffix. It is becoming a mess. In addition, I would like to be able to quickly access all the columns from the first table, second one, etc. Is there a way to merge these columns but keep them as a group? So that I don't need to change the column name and I can access all the columns of each data set easier?

You could add a column `data_source` for each dataframe and join them vertically. — Quang Hoang, Apr 29 '19 at 12:41
This is a good idea to be able to quickly access where they come from, but I still need to change the column names if there is a duplicated name. True? — Laleh, Apr 29 '19 at 12:55
Not if you join them vertically, instead of horizontally as you do now. — Quang Hoang, Apr 29 '19 at 12:56

score 1 · Accepted Answer · answered Apr 29 '19 at 15:04

Here is a way to merge the dataframes horizonthally with the help of MultiIndex, which has some advantages comparing to vertical merge. For example, you won't have lots of NaN fields, and the dtype's won't change from int to float like in the case with horizonthal merge.

import numpy as np
import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
result = pd.concat(dfs, axis=1, keys=range(len(dfs)))
print(result)

This will give:

    0                   1                
    A   B   C   D   E   G   H   A   B   C
0  41  49  13  36  57  28  12  82  18  67
1  72  91  34  17  12   6  67  98  36  25

You can access each group in a loop:

for source_index, df in result.groupby(axis=1, level=0):
    print(df)

    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12
    1                
    G   H   A   B   C
0  28  12  82  18  67
1   6  67  98  36  25

or individually:

gb = result.groupby(axis=1, level=0)
first_group = gb.get_group(0)
print(first_group)

    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12

References:

Thanks Georgy, this is exactly what I was searching for! – Laleh Apr 30 '19 at 07:17 — Laleh, Apr 30 '19 at 07:17

score 0 · Answer 2 · answered Apr 29 '19 at 13:43

Here is an illustration of what @QuangHoang proposed in their comment:

import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
for source_index, df in enumerate(dfs):
    df['data_source'] = source_index
result = pd.concat(dfs)
print(result)

will give all your dataframes joined vertically:

    A   B   C     D     E     G     H  data_source
0  66  52  16  73.0  59.0   NaN   NaN            0
1  73  64  59  31.0  13.0   NaN   NaN            0
0  72  79  45   NaN   NaN  30.0   0.0            1
1  45  52  40   NaN   NaN   2.0  80.0            1

To access each group in a loop you can do:

for source_index, df in result.groupby('data_source'):
    print(df.dropna(axis=1))

    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0
    A   B   C     G     H  data_source
0  72  79  45  30.0   0.0            1
1  45  52  40   2.0  80.0            1

or by an index of the source:

gb = result.groupby('data_source')
source_index = 0
first_group = gb.get_group(source_index).dropna(axis=1)
print(first_group)

    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0

Thanks for the detailed explanation. My data sets are large and the number of columns they have in common is not a lot (compared to the total number of columns). I am not sure If I should create a large data frame with lots of NaN. — Laleh, Apr 29 '19 at 14:17
In addition, the columns I am reading are going to be used all for training a model, so still it makes sense to keep them joined horizentally — Laleh, Apr 29 '19 at 14:18

Group by columns while merging data frames

2 Answers2