How to join two RDD's with specific column in Pyspark?

Question

How can I Join two RDD with item_id columns

 ##
  RDD1 = spark.createDataFrame([('45QNN', 867),
                        ('45QNN', 867),
                        ('45QNN',  900  )]
                       , ['id', 'item_id'])
 RDD1=RDD1.rdd

 RDD2 = spark.createDataFrame([('867',229000,'house',90),
                        ('900',350000,'apartment',120)]
                       , ['item_id', 'amount','parent','size'])
 RDD2=RDD2.rdd

As suggested How do you perform basic joins of two RDD tables in Spark using Python? I did try but it gets empty data set

innerJoinedRdd = RDD1.join(RDD2)
or
RDD2.join(RDD1, RDD1("item_id")==RDD2("item_id")).take(5)

I need all the columns except parent. Please help?

Possible duplicate of [How do you perform basic joins of two RDD tables in Spark using Python?](https://stackoverflow.com/questions/31257077/how-do-you-perform-basic-joins-of-two-rdd-tables-in-spark-using-python) — pault, Sep 05 '19 at 19:12
`join` assumes the first "column" is the join key. You have to map your first `rdd` to change the order of elements. Try `rdd1.map(lambda x: (x[1], x[0])).join(rdd2)`. Also why are you using `rdd`? Just use the DataFrame: `df1.join(df2, on="item_id")` — pault, Sep 05 '19 at 19:17
Thanks but stil did not work . [] this is what I got. I have a big data set that I have to use RDD — melik, Sep 05 '19 at 19:20
in one rdd you have a string and the other it's an int. Try: `rdd1.map(lambda x: (str(x[1]), x[0])).join(rdd2)`. *I have a big data set that I have to use RDD* - that makes no sense- [spark dataframes are more efficient than `rdd` for big data.](https://stackoverflow.com/questions/31508083/difference-between-dataframe-dataset-and-rdd-in-spark) — pault, Sep 05 '19 at 19:38

How to join two RDD's with specific column in Pyspark?

0 Answers0