-1

I have two dataframes with diffrenet sizes (df1 is larger than df2):

df1:
coulmn 1(contains duplicates) column 2 (unique)
A                                B
A                                C
G                                H

df2
coulmn 3(contains duplicates) column 4
A                               1
A                               1
G                               0

I want to create a new dataframe where column 3 is equal to coulmn 1 (including duplicate matches) and I want to grab values from column 2 and coulmn 4. I want my resulting df to look like this:

column 2   column 4
B            1
C            1
H            0

what is the fastest way to implement this on millions of rows?

Sina
  • 183
  • 14

1 Answers1

0

You can just simply merge them on these columns and have result deduplicated:

df1.merge(df2, right_on='column 3', left_on='column 1')[['column 2', 'column 4']].drop_duplicates()

Result will look like:

column 2 column 4
0 B 1
2 C 1
4 H 0
Hamza
  • 5,373
  • 3
  • 28
  • 43
  • This is not efficient when dealing with millions of rows. The method is sound though... – Sina May 26 '22 at 15:10
  • Merging IS the most efficient pandas way for these types of use cases. If your dataset has millions of rows rthen you might need to look for alternatives like `dask` or have some optimizations as in here: https://stackoverflow.com/a/64976482/7212929 – Hamza May 26 '22 at 15:57