Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

Question

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
|_ Column1a
|_ Column1b

df2:
|_ Column2a
|_ Column2b

I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?

Possible duplicate of [How do I add a new column to a Spark DataFrame (using PySpark)?](http://stackoverflow.com/questions/33681487/how-do-i-add-a-new-column-to-a-spark-dataframe-using-pyspark) — , Nov 16 '16 at 07:00
The solution looks at transforming existing Columns in a dataframe or creating a new column, whereas I want to pick Column1a and Column1b to form a new Dataframe. — ansh.gandhi, Nov 16 '16 at 08:30
Is the context of the join based on the position? For example, would using the `rownumber()` approach in this answer work? http://stackoverflow.com/a/40626348/1100699 — Denny Lee, Nov 16 '16 at 18:28
I need to give this a try, that might just do it. I'll try it over the weekend. Thanks for the help. I'll get back about how it goes. — ansh.gandhi, Nov 18 '16 at 00:56

score 0 · Accepted Answer · answered Dec 01 '16 at 01:03

0

Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.

answered Dec 01 '16 at 01:03

ansh.gandhi

82
11

Creating a pyspark.sql.dataframe out of two columns in two different pyspark.sql.dataframes in PySpark

1 Answers1