0

I have this join in a pyspark script.

    d = d.join(p, [
        d.p_hash == p.hash,
        d.dy >= p.mindy,
        d.dy <= p.maxdy,
    ], "left") \
    .drop(p.hash) \
    .drop(p.mindy) \
    .drop(p.maxdy)

The variables 'd' and 'p' are spark dataframes. Is there any way I could do this in pandas?

samkart
  • 6,007
  • 2
  • 14
  • 29

1 Answers1

1

Yes, you can simply do the merge and filter the data frame with your condition, then drop the unwanted columns.

d = d.merge(p, left_on=['p_hash'], right_on=['hash'], how='left')
d = d[(d['dy'] >= d['mindy']) & (d['dy'] <= d['maxdy'])]
d = d.drop(['hash', 'mindy', 'maxdy'], axis=1)

Merge on pandas isn't quite like on pyspark, it doesn't have conditional join.

You can also review answers from here: How to do/workaround a conditional join in python Pandas?

Emmanuel Murairi
  • 321
  • 2
  • 15