I am new to spark (with python) and have searched all through for solutions to what I'm trying to do but haven't found anything that relates to this.
I Have two data frames, One called quantity and another called price
Quantity
+----+-----+-----+----+
|ID| Price_perf | Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 220|
| 2 | NULL | 6 | 290|
| 2 | NULL | 6 | 270|
+----+-----+-----+----+
Price
+----+-----+-----+----+
|ID| Price| Size|Date|Sourceid|
+---- +----- +----- +----+
| 1 | 7.5 | 3 |2017-01-03| 223|
| 1 | 39 | 3 |2012-01-06| 223|
| 1 | 12 | 3 |2009-04-01| 223|
| 1 | 28 | 3 |2011-11-08| 223|
| 1 | 9 | 3 |2012-09-12| 223|
| 1 | 15 | 3 |2017-07-03| 220|
| 1 | 10 | 3 |2017-05-03 | 220|
| 1 | 33 | 3 |2012-03-08 | 220|
+----+-----+-----+----+
Firstly, I am trying to join the above two dataframes and return a data frame that contains only values that have the same ID and SourceID
I have tried to do that by doing the following:
c= quantity.join(price,price.id==quantity.id, price.souceid==quantity.sourceid "left")
c.show()
This is the result I want to get but I'm not getting:
+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL |7.5 |2017-01-03 |3 | 223|
| 1 | NULL | 9 |2012-01-06 |3 | 223|
| 1 | NULL | 12 |2009-04-01|3 | 223|
| 1 | NULL | 28 |2011-11-08| 3 | 223|
| 1 | NULL | 9 |2012-09-12| 3 | 223|
| 1 | NULL | 15 |2017-07-03 |3 | 220|
| 1 | NULL | 10 |2017-05-03 |3 | 220|
| 1 | NULL |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+
Secondly, after doing the join, I'm trying to get the difference in price between the min and max dates in the joined data frame and input it as the Price_perf
This is what I've tried:
def modify_values(c):
for x in c:
if quantity.sourceid == price.sourceid:
return price.price(min(Date)) - price.price(max(Date))
else:
return "Not found"
ol_val = udf(modify_values, StringType())
ol_val.show()
So the final output should look something like this:
+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | 4.5 |7.5 |2017-01-03 |3 | 223|
| 1 | 4.5 | 9 |2012-01-06 |3 | 223|
| 1 | 4.5 | 12 |2009-04-01|3 | 223|
| 1 | 4.5 | 28 |2011-11-08| 3 | 223|
| 1 | 4.5 | 9 |2012-09-12| 3 | 223|
| 1 | 18 | 15 |2017-07-03 |3 | 220|
| 1 | 18 | 10 |2017-05-03 |3 | 220|
| 1 | 18 |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+