How to pd.merge(..., on="column", ...) when processing data in chunks?

Question

I want to merge two dataframes, however, the right dataframe has to be processed in chunks due to size. From the second iteration (i.e. when merging chunk2 into df) merge creates additional columns (see MWE), however, I want to merge into the old columns.

Please note that the (date) integer in column A is not a unique index in df.

import pandas as pd

df = pd.DataFrame({'A': [20170801, 20170801, 20170802, 20170901],
                    'B': ['B0', 'B1', 'B2', 'B3'],
                'C': ['C0', 'C1', 'C2', 'C3'],
                'D': ['D0', 'D1', 'D2', 'D3']},
                index=[0, 1, 2, 3])

chunk1 = pd.DataFrame({'A': [20170801, 20170802, 4, 4],
                'E': ['B4', 'B5', 'B6', 'B7'],
                'F': ['C4', 'C5', 'C6', 'C7'],
                'G': ['D4', 'D5', 'D6', 'D7']},
                 index=[0, 1, 2, 3])

chunk2 = pd.DataFrame({'A': [20170901, 67, 68, 69],
                'E': ['B4', 'B5', 'B6', 'B7'],
                'F': ['C4', 'C5', 'C6', 'C7'],
                'G': ['D4', 'D5', 'D6', 'D7']},
                 index=[0, 1, 2, 3])

df = df.merge(chunk1, on='A', how='left')
print(df)

      A   B   C   D    E    F    G
0  20170801  B0  C0  D0   B4   C4   D4
1  20170801  B1  C1  D1   B4   C4   D4
2  20170802  B2  C2  D2   B5   C5   D5
3  20170901  B3  C3  D3  NaN  NaN  NaN

df = df.merge(chunk2, on='A', how='left')
print(df)

          A   B   C   D  E_x  F_x  G_x  E_y  F_y  G_y
0  20170801  B0  C0  D0   B4   C4   D4  NaN  NaN  NaN
1  20170801  B1  C1  D1   B4   C4   D4  NaN  NaN  NaN
2  20170802  B2  C2  D2   B5   C5   D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  NaN  NaN  NaN   B4   C4   D4

The output should look like this:

      A   B   C   D    E    F    G
0  20170801  B0  C0  D0   B4   C4   D4
1  20170801  B1  C1  D1   B4   C4   D4
2  20170802  B2  C2  D2   B5   C5   D5
3  20170901  B3  C3  D3   B4   C4   D4

@jezrael, please consider reopening the question - your link did not provide a solution (or I was not able to find it there.). — CFW, Jan 14 '19 at 15:06

score 0 · Answer 1 · answered Jan 14 '19 at 17:03

merge allows to change the default suffix of the overlapping columns. Once this is done, you just override the NaN values and drop the now useless columns. It takes some extra steps but is simple.

So it comes to:

df = df.merge(chunk1, on='A', how='left')    # merge chunk1
df = df.merge(chunk2, on='A', how='left', suffixes = ('', '_x'))   # for chunk2, tweak column names

print(df)      # control

          A   B   C   D    E    F    G  E_x  F_x  G_x
0  20170801  B0  C0  D0   B4   C4   D4  NaN  NaN  NaN
1  20170801  B1  C1  D1   B4   C4   D4  NaN  NaN  NaN
2  20170802  B2  C2  D2   B5   C5   D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  NaN  NaN  NaN   B4   C4   D4

# override NaN in columns E, F, G
for col in (list('EFG')):
    col2 = col+'_x'
    df[col] = df.apply(lambda x: x[col2] if x[col] is numpy.NaN else x[col],
                         axis=1)


print(df)   # control

          A   B   C   D   E   F   G  E_x  F_x  G_x
0  20170801  B0  C0  D0  B4  C4  D4  NaN  NaN  NaN
1  20170801  B1  C1  D1  B4  C4  D4  NaN  NaN  NaN
2  20170802  B2  C2  D2  B5  C5  D5  NaN  NaN  NaN
3  20170901  B3  C3  D3  B4  C4  D4   B4   C4   D4

df.drop(columns=['E_x', 'F_x', 'G_x'], inplace = True)    # drop now useless columns

print(df)    # as expected

          A   B   C   D   E   F   G
0  20170801  B0  C0  D0  B4  C4  D4
1  20170801  B1  C1  D1  B4  C4  D4
2  20170802  B2  C2  D2  B5  C5  D5
3  20170901  B3  C3  D3  B4  C4  D4

No need for `apply` loop. Vectorized `numpy.where` or `pandas.Series.where` can work or even `combine_first`. OP's needs are similar to SQL's coalesce. See: https://stackoverflow.com/questions/38152389/coalesce-values-from-2-columns-into-a-single-column-in-a-pandas-dataframe. — Parfait, Jan 14 '19 at 17:28

How to pd.merge(..., on="column", ...) when processing data in chunks?

1 Answers1