I have 2 data frames in Apache Spark, each with a column named "JoinValue". JoinValue is numeric and has the same semantics and meaning in both data frames.
I need the combination of both data frames as input (training and test set) for a Machine Learning algorithm. Is it correct that I first need to combine both DataFrames into a single DataFrame before using it in an ML Pipeline?
Example:
df1.show() +---------+---------+ | a|JoinValue| +---------+---------+ |A value 0| 0| |A value 1| 5| |A value 2| 10| |A value 3| 15| |A value 4| 20| |A value 5| 25| |A value 6| 30| +---------+---------+
and
> df2.show() +---------+---------+ | b|JoinValue| +---------+---------+ |B value 0| 0| |B value 1| 7| |B value 2| 14| |B value 3| 21| |B value 4| 28| +---------+---------+
An outer join followed by an orderBy
yields the following results:
> df1.join(df2, 'JoinValue', 'outer').orderBy('JoinValue').show()
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1| null|
| 7| null|B value 1|
| 10|A value 2| null|
| 14| null|B value 2|
| 15|A value 3| null|
| 20|A value 4| null|
| 21| null|B value 3|
| 25|A value 5| null|
| 28| null|B value 4|
| 30|A value 6| null|
+---------+---------+---------+
What I actually want is this, without null
s:
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1|B value 0|
| 7|A value 1|B value 1|
| 10|A value 2|B value 1|
| 14|A value 2|B value 2|
| 15|A value 3|B value 2|
| 20|A value 4|B value 2|
| 21|A value 4|B value 3|
| 25|A value 5|B value 3|
| 28|A value 5|B value 4|
| 30|A value 6|B value 4|
+---------+---------+---------+
What is the best way to use the JoinValue, a and b, coming from multiple data frames as features and labels in a machine learning algorithm?