Pyspark filter with a column from a different dataframe

Question

I would like to filter Id from price where exist in events dataframe. My code is below but it is not working in pyspark. How am I going to fix this?

events = spark.createDataFrame([(657,'Conferences'),
                          (765, 'Seminars '),
                          (776, 'Meetings'),
                          (879, 'Conferences'),
                          (765, 'Meetings'),
                          (879, 'Seminars'),
                          (985, 'Meetings'),
                          (879, 'Meetings'),
                          (657, 'Seminars'),
                          (657,'Conferences')]
                         ,['Id', 'event_name'])
events.show()
price = spark.createDataFrame([(657,10),
                          (879,45),
                          (776,54),
                          (879,45),
                          (765, 65)]
                         ,['Id','Price'])


price[price.Id.isin(events.Id)].show()

Possible duplicate of [What are the various join types in Spark?](https://stackoverflow.com/questions/45990633/what-are-the-various-join-types-in-spark) — pault, Jun 26 '19 at 18:45

score 0 · Answer 1 · answered Jun 26 '19 at 17:23

0

A simple join will get only the prices for the ids present in the events table

events.join(price, "Id").select("Id", "Price").distinct().show()

answered Jun 26 '19 at 17:23

Vincent

113
2
9

Pyspark filter with a column from a different dataframe

1 Answers1