2

Assume the following two Dataframes in pyspark with equal number of rows:
df1:
 |_ Column1a
 |_ Column1b

df2:
 |_ Column2a
 |_ Column2b

I wish to create a a new DataFrame "df" which has Column1a and Column 2a only. What could be the best possible solution for it?

ansh.gandhi
  • 82
  • 11
  • Possible duplicate of [How do I add a new column to a Spark DataFrame (using PySpark)?](http://stackoverflow.com/questions/33681487/how-do-i-add-a-new-column-to-a-spark-dataframe-using-pyspark) –  Nov 16 '16 at 07:00
  • The solution looks at transforming existing Columns in a dataframe or creating a new column, whereas I want to pick Column1a and Column1b to form a new Dataframe. – ansh.gandhi Nov 16 '16 at 08:30
  • Is the context of the join based on the position? For example, would using the `rownumber()` approach in this answer work? http://stackoverflow.com/a/40626348/1100699 – Denny Lee Nov 16 '16 at 18:28
  • I need to give this a try, that might just do it. I'll try it over the weekend. Thanks for the help. I'll get back about how it goes. – ansh.gandhi Nov 18 '16 at 00:56

1 Answers1

0

Denny Lee's answer is the way.
It involves creating another column on both the DataFrames which is the Unique_Row_ID for every row. We then perform a join on Unique_Row_ID. And then drop Unique_Row_ID if required.

ansh.gandhi
  • 82
  • 11