Fastet way to create a new dataframe from two diffrenet dataframes with diffrenet sizes while perserving duplicates

Question

I have two dataframes with diffrenet sizes (df1 is larger than df2):

df1:
coulmn 1(contains duplicates) column 2 (unique)
A                                B
A                                C
G                                H

df2
coulmn 3(contains duplicates) column 4
A                               1
A                               1
G                               0

I want to create a new dataframe where column 3 is equal to coulmn 1 (including duplicate matches) and I want to grab values from column 2 and coulmn 4. I want my resulting df to look like this:

column 2   column 4
B            1
C            1
H            0

what is the fastest way to implement this on millions of rows?

score 0 · Answer 1 · answered May 26 '22 at 14:12

0

You can just simply merge them on these columns and have result deduplicated:

df1.merge(df2, right_on='column 3', left_on='column 1')[['column 2', 'column 4']].drop_duplicates()

Result will look like:

	column 2	column 4
0	B	1
2	C	1
4	H	0

answered May 26 '22 at 14:12

Hamza

5,373
3
28
43

This is not efficient when dealing with millions of rows. The method is sound though... – Sina May 26 '22 at 15:10
Merging IS the most efficient pandas way for these types of use cases. If your dataset has millions of rows rthen you might need to look for alternatives like `dask` or have some optimizations as in here: https://stackoverflow.com/a/64976482/7212929 – Hamza May 26 '22 at 15:57

Fastet way to create a new dataframe from two diffrenet dataframes with diffrenet sizes while perserving duplicates

1 Answers1