-1

I have two data frames of different sizes with repeating values. I want to implement the following in Python:

This is first dataframe:

    col1
0      1
1      2
2      3
3      4
4      5
5      1
6      2
7      3
8      4
9      5
10     1
11     2
12     3
13     4
14     5

this is second dataframe:

   col1  col2
0     1     2
1     2    23
2     4    45
3     2    67
4     3    12
5     5    91
6     1    21
7     2    23
8     5    43

This is the output which I wish to get

mozway
  • 194,879
  • 13
  • 39
  • 75
  • dupe - https://stackoverflow.com/questions/66096921/merge-columns-with-different-number-of-rows-based-on-two-first-columns-in-pandas/66097041#66097041 – jezrael Aug 30 '23 at 13:09
  • Please focus your question on the specific problem you encountered when trying yourself. – Yunnosch Aug 30 '23 at 13:09

1 Answers1

2

You can perform a merge with de-duplication, using diff and cumsum to de-duplicate:

out = df1.merge(df2,
                left_on=['col1', df1['col1'].diff().le(0).cumsum()],
                right_on=['col1', df2['col1'].diff().le(0).cumsum()],
                how='left').drop(columns=['key_1'])

Output:

    col1  col2
0      1   2.0
1      2  23.0
2      3   NaN
3      4  45.0
4      5   NaN
5      1   NaN
6      2  67.0
7      3  12.0
8      4   NaN
9      5  91.0
10     1  21.0
11     2  23.0
12     3   NaN
13     4   NaN
14     5  43.0

Intermediates:

# df1 de-duplication
    col1  diff  le(0)  cumsum
0      1   NaN  False       0
1      2   1.0  False       0
2      3   1.0  False       0
3      4   1.0  False       0
4      5   1.0  False       0
5      1  -4.0   True       1
6      2   1.0  False       1
7      3   1.0  False       1
8      4   1.0  False       1
9      5   1.0  False       1
10     1  -4.0   True       2
11     2   1.0  False       2
12     3   1.0  False       2
13     4   1.0  False       2
14     5   1.0  False       2

# df2 de-duplication
   col1  col2  diff  le(0)  cumsum
0     1     2   NaN  False       0
1     2    23   1.0  False       0
2     4    45   2.0  False       0
3     2    67  -2.0   True       1
4     3    12   1.0  False       1
5     5    91   2.0  False       1
6     1    21  -4.0   True       2
7     2    23   1.0  False       2
8     5    43   3.0  False       2

# output
    col1  key_1  col2
0      1      0   2.0
1      2      0  23.0
2      3      0   NaN
3      4      0  45.0
4      5      0   NaN
5      1      1   NaN
6      2      1  67.0
7      3      1  12.0
8      4      1   NaN
9      5      1  91.0
10     1      2  21.0
11     2      2  23.0
12     3      2   NaN
13     4      2   NaN
14     5      2  43.0
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Why reopened dupe? https://stackoverflow.com/questions/66096921/merge-columns-with-different-number-of-rows-based-on-two-first-columns-in-pandas/66097041#66097041 – jezrael Aug 30 '23 at 13:07
  • Because it's not the same logic, and was fully different from the [initial linked Q/A](https://stackoverflow.com/q/53010406/16343464) – mozway Aug 30 '23 at 13:08
  • This situation need comment, not reopen. – jezrael Aug 30 '23 at 13:09