0

I am merging several data frames into one data frame in a for loop. something like this:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0,100,(2,5)),columns=list('GHABC'))
df1 = df1.merge(df2, how='outer', left_index=True, right_index=True,suffixes=('', '_' + 'second'))

Several columns from each of these data frames have similar names so each time there is such a column I am adding a suffix. It is becoming a mess. In addition, I would like to be able to quickly access all the columns from the first table, second one, etc. Is there a way to merge these columns but keep them as a group? So that I don't need to change the column name and I can access all the columns of each data set easier?

Laleh
  • 488
  • 4
  • 16

2 Answers2

1

Here is a way to merge the dataframes horizonthally with the help of MultiIndex, which has some advantages comparing to vertical merge. For example, you won't have lots of NaN fields, and the dtype's won't change from int to float like in the case with horizonthal merge.

import numpy as np
import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
result = pd.concat(dfs, axis=1, keys=range(len(dfs)))
print(result)

This will give:

    0                   1                
    A   B   C   D   E   G   H   A   B   C
0  41  49  13  36  57  28  12  82  18  67
1  72  91  34  17  12   6  67  98  36  25

You can access each group in a loop:

for source_index, df in result.groupby(axis=1, level=0):
    print(df)
    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12
    1                
    G   H   A   B   C
0  28  12  82  18  67
1   6  67  98  36  25

or individually:

gb = result.groupby(axis=1, level=0)
first_group = gb.get_group(0)
print(first_group)
    0                
    A   B   C   D   E
0  41  49  13  36  57
1  72  91  34  17  12

References:

Georgy
  • 12,464
  • 7
  • 65
  • 73
0

Here is an illustration of what @QuangHoang proposed in their comment:

import pandas as pd


df1 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('ABCDE'))
df2 = pd.DataFrame(data=np.random.randint(0, 100, (2, 5)),
                   columns=list('GHABC'))
dfs = [df1, df2]
for source_index, df in enumerate(dfs):
    df['data_source'] = source_index
result = pd.concat(dfs)
print(result)

will give all your dataframes joined vertically:

    A   B   C     D     E     G     H  data_source
0  66  52  16  73.0  59.0   NaN   NaN            0
1  73  64  59  31.0  13.0   NaN   NaN            0
0  72  79  45   NaN   NaN  30.0   0.0            1
1  45  52  40   NaN   NaN   2.0  80.0            1

To access each group in a loop you can do:

for source_index, df in result.groupby('data_source'):
    print(df.dropna(axis=1))
    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0
    A   B   C     G     H  data_source
0  72  79  45  30.0   0.0            1
1  45  52  40   2.0  80.0            1

or by an index of the source:

gb = result.groupby('data_source')
source_index = 0
first_group = gb.get_group(source_index).dropna(axis=1)
print(first_group)
    A   B   C     D     E  data_source
0  66  52  16  73.0  59.0            0
1  73  64  59  31.0  13.0            0
Georgy
  • 12,464
  • 7
  • 65
  • 73
  • Thanks for the detailed explanation. My data sets are large and the number of columns they have in common is not a lot (compared to the total number of columns). I am not sure If I should create a large data frame with lots of NaN. – Laleh Apr 29 '19 at 14:17
  • In addition, the columns I am reading are going to be used all for training a model, so still it makes sense to keep them joined horizentally – Laleh Apr 29 '19 at 14:18