I want to merge two dataframes, however, the right dataframe has to be processed in chunks due to size. From the second iteration (i.e. when merging chunk2 into df) merge creates additional columns (see MWE), however, I want to merge into the old columns.
Please note that the (date) integer in column A is not a unique index in df.
import pandas as pd
df = pd.DataFrame({'A': [20170801, 20170801, 20170802, 20170901],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
chunk1 = pd.DataFrame({'A': [20170801, 20170802, 4, 4],
'E': ['B4', 'B5', 'B6', 'B7'],
'F': ['C4', 'C5', 'C6', 'C7'],
'G': ['D4', 'D5', 'D6', 'D7']},
index=[0, 1, 2, 3])
chunk2 = pd.DataFrame({'A': [20170901, 67, 68, 69],
'E': ['B4', 'B5', 'B6', 'B7'],
'F': ['C4', 'C5', 'C6', 'C7'],
'G': ['D4', 'D5', 'D6', 'D7']},
index=[0, 1, 2, 3])
df = df.merge(chunk1, on='A', how='left')
print(df)
A B C D E F G
0 20170801 B0 C0 D0 B4 C4 D4
1 20170801 B1 C1 D1 B4 C4 D4
2 20170802 B2 C2 D2 B5 C5 D5
3 20170901 B3 C3 D3 NaN NaN NaN
df = df.merge(chunk2, on='A', how='left')
print(df)
A B C D E_x F_x G_x E_y F_y G_y
0 20170801 B0 C0 D0 B4 C4 D4 NaN NaN NaN
1 20170801 B1 C1 D1 B4 C4 D4 NaN NaN NaN
2 20170802 B2 C2 D2 B5 C5 D5 NaN NaN NaN
3 20170901 B3 C3 D3 NaN NaN NaN B4 C4 D4
The output should look like this:
A B C D E F G
0 20170801 B0 C0 D0 B4 C4 D4
1 20170801 B1 C1 D1 B4 C4 D4
2 20170802 B2 C2 D2 B5 C5 D5
3 20170901 B3 C3 D3 B4 C4 D4