-3

I have two table A and B with hundred of columns. I am trying to apply left outer join on two table but they both have different keys. I created a new column with same key in B as A. Then was able to apply left outer join. However, how do I join both tables if I am unable to make the column names consistent?

This is what I have tried:

a = spark.table('a').rdd
a = spark.table('a')
b = b.withColumn("acct_id",col("id"))
b = b.rdd

a.leftOuterJoin(b).collect()
user1584253
  • 975
  • 2
  • 18
  • 55
  • Possible duplicate of [Join two DataFrames where the join key is different and only select some columns](https://stackoverflow.com/questions/49685474/join-two-dataframes-where-the-join-key-is-different-and-only-select-some-columns) – pault Apr 09 '19 at 15:24
  • I want to join using RDD not spark dataframes – user1584253 Apr 09 '19 at 15:25
  • Each record in an `rdd` is a tuple where the first entry is the key. When you call join, it does so on the keys. So if you want to join on a specific column, you need to map your records so the join column is first. It's hard to explain in more detail without a [reproducible example](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Apr 09 '19 at 15:29

2 Answers2

1

If you have dataframe then why you are creating rdd for that, is there any specific need?

Try below command on dataframes -

a.join(b,  a.column_name==b.column_name, 'left').show()

Here are few commands you can use to investigate your dataframe

##Get column names of dataframe
a.columns

##Get column names with their datatype of dataframe
a.dtypes

##What is the type of object (eg. dataframe, rdd etc.)
type(a)
Shantanu Sharma
  • 3,661
  • 1
  • 18
  • 39
  • I think RDD is much faster than dataframe. Here is the reference: https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html – user1584253 Apr 09 '19 at 15:22
  • 1
    @user1584253 that article is from 3 years ago on spark 1.6 – pault Apr 09 '19 at 15:25
  • dataframes has additional metadata, which allows Spark to run certain optimizations on execution time and the way spark evolved in recent versions it's recommended to use dataframe until and unless you have very specific requirement and you think that your rdd based code is more optimized than the execution plan spark is creating for your datarame based code. – Shantanu Sharma Apr 09 '19 at 15:32
0

DataFrames are faster than rdd, and you already have dataframes, so I sugest:

a = spark.table('a')
b = spark.table('b').withColumn("acct_id",col("id"))

result = pd.merge(a, b, left_on='id', right_on='acct_id', how='left').rdd