How to join pandas dataframe with multiple columns and conditions like pyspark

Question

I have this join in a pyspark script.

    d = d.join(p, [
        d.p_hash == p.hash,
        d.dy >= p.mindy,
        d.dy <= p.maxdy,
    ], "left") \
    .drop(p.hash) \
    .drop(p.mindy) \
    .drop(p.maxdy)

The variables 'd' and 'p' are spark dataframes. Is there any way I could do this in pandas?

score 1 · Accepted Answer · answered Jun 15 '23 at 04:49

1

Yes, you can simply do the merge and filter the data frame with your condition, then drop the unwanted columns.

d = d.merge(p, left_on=['p_hash'], right_on=['hash'], how='left')
d = d[(d['dy'] >= d['mindy']) & (d['dy'] <= d['maxdy'])]
d = d.drop(['hash', 'mindy', 'maxdy'], axis=1)

Merge on pandas isn't quite like on pyspark, it doesn't have conditional join.

You can also review answers from here: How to do/workaround a conditional join in python Pandas?

answered Jun 15 '23 at 04:49

Emmanuel Murairi

321
2
15

yaaa thanks man. I got so caught up in joins that I ended up forgetting about filters.. – wedrano de carvalho Jun 15 '23 at 12:37

How to join pandas dataframe with multiple columns and conditions like pyspark

1 Answers1