pyspark RDD - Left outer join on specific key

Question

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys. I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?

This is what I have tried:

a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd

a.leftOuterJoin(b).collect()

Possible duplicate of [Join two DataFrames where the join key is different and only select some columns](https://stackoverflow.com/questions/49685474/join-two-dataframes-where-the-join-key-is-different-and-only-select-some-columns) — pault, Apr 09 '19 at 15:24
Each record in an `rdd` is a tuple where the first entry is the key. When you call join, it does so on the keys. So if you want to join on a specific column, you need to map your records so the join column is first. It's hard to explain in more detail without a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). — pault, Apr 09 '19 at 15:29

score 1 · Answer 1 · answered Apr 09 '19 at 15:18

1

If you have dataframe then why you are creating rdd for that, is there any specific need?

Try below command on dataframes -

a.join(b,  a.column_name==b.column_name, 'left').show()

Here are few commands you can use to investigate your dataframe

##Get column names of dataframe
a.columns

##Get column names with their datatype of dataframe
a.dtypes

##What is the type of object (eg. dataframe, rdd etc.)
type(a)

answered Apr 09 '19 at 15:18

Shantanu Sharma

3,661
1
18
39

I think RDD is much faster than dataframe. Here is the reference: https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html – user1584253 Apr 09 '19 at 15:22
1

@user1584253 that article is from 3 years ago on spark 1.6 – pault Apr 09 '19 at 15:25
dataframes has additional metadata, which allows Spark to run certain optimizations on execution time and the way spark evolved in recent versions it's recommended to use dataframe until and unless you have very specific requirement and you think that your rdd based code is more optimized than the execution plan spark is creating for your datarame based code. – Shantanu Sharma Apr 09 '19 at 15:32

score 0 · Answer 2 · answered Apr 09 '19 at 15:18

0

DataFrames are faster than rdd, and you already have dataframes, so I sugest:

a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))

result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd

answered Apr 09 '19 at 15:18

Pablo López Gallego

690
4
16

pyspark RDD - Left outer join on specific key

2 Answers2