I'm using PySpark (Spark 1.4.1) in a cluster. I have two DataFrames each containing the same key values but different data for the other fields.
I partitioned each DataFrame separately using the key and wrote a parquet file to HDFS. I then read the parquet file back into memory as a new DataFrame. If I join the two DataFrames, will the processing for the join happen on the same workers?
For example:
dfA
contains {userid
,firstname
,lastname
} partitioned byuserid
dfB
contains {userid
,activity
,job
,hobby
} partitioned byuserid
dfC = dfA.join(dfB, dfA.userid==dfB.userid)
Is dfC
already partitioned by userid
?