Compare elements in two dataframes and the Fill missing ones with nan in larger dataframe

Question

I have two data frames of different sizes with repeating values. I want to implement the following in Python:

This is first dataframe:

this is second dataframe:

   col1  col2
0     1     2
1     2    23
2     4    45
3     2    67
4     3    12
5     5    91
6     1    21
7     2    23
8     5    43

This is the output which I wish to get

dupe - https://stackoverflow.com/questions/66096921/merge-columns-with-different-number-of-rows-based-on-two-first-columns-in-pandas/66097041#66097041 — jezrael, Aug 30 '23 at 13:09
Please focus your question on the specific problem you encountered when trying yourself. — Yunnosch, Aug 30 '23 at 13:09

mozway · Answer 1 · 2023-08-30T13:16:52.153

You can perform a merge with de-duplication, using diff and cumsum to de-duplicate:

out = df1.merge(df2,
                left_on=['col1', df1['col1'].diff().le(0).cumsum()],
                right_on=['col1', df2['col1'].diff().le(0).cumsum()],
                how='left').drop(columns=['key_1'])

Output:

    col1  col2
0      1   2.0
1      2  23.0
2      3   NaN
3      4  45.0
4      5   NaN
5      1   NaN
6      2  67.0
7      3  12.0
8      4   NaN
9      5  91.0
10     1  21.0
11     2  23.0
12     3   NaN
13     4   NaN
14     5  43.0

Intermediates:

# df1 de-duplication
    col1  diff  le(0)  cumsum
0      1   NaN  False       0
1      2   1.0  False       0
2      3   1.0  False       0
3      4   1.0  False       0
4      5   1.0  False       0
5      1  -4.0   True       1
6      2   1.0  False       1
7      3   1.0  False       1
8      4   1.0  False       1
9      5   1.0  False       1
10     1  -4.0   True       2
11     2   1.0  False       2
12     3   1.0  False       2
13     4   1.0  False       2
14     5   1.0  False       2

# df2 de-duplication
   col1  col2  diff  le(0)  cumsum
0     1     2   NaN  False       0
1     2    23   1.0  False       0
2     4    45   2.0  False       0
3     2    67  -2.0   True       1
4     3    12   1.0  False       1
5     5    91   2.0  False       1
6     1    21  -4.0   True       2
7     2    23   1.0  False       2
8     5    43   3.0  False       2

# output
    col1  key_1  col2
0      1      0   2.0
1      2      0  23.0
2      3      0   NaN
3      4      0  45.0
4      5      0   NaN
5      1      1   NaN
6      2      1  67.0
7      3      1  12.0
8      4      1   NaN
9      5      1  91.0
10     1      2  21.0
11     2      2  23.0
12     3      2   NaN
13     4      2   NaN
14     5      2  43.0

Why reopened dupe? https://stackoverflow.com/questions/66096921/merge-columns-with-different-number-of-rows-based-on-two-first-columns-in-pandas/66097041#66097041 — jezrael, Aug 30 '23 at 13:07
Because it's not the same logic, and was fully different from the [initial linked Q/A](https://stackoverflow.com/q/53010406/16343464) — mozway, Aug 30 '23 at 13:08

Compare elements in two dataframes and the Fill missing ones with nan in larger dataframe

1 Answers1