0

I am new to spark (with python) and have searched all through for solutions to what I'm trying to do but haven't found anything that relates to this.

I Have two data frames, One called quantity and another called price

Quantity
+----+-----+-----+----+
|ID| Price_perf | Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 223|
| 1 | NULL | 3 | 220|
| 2 | NULL | 6 | 290|
| 2 | NULL | 6 | 270|
+----+-----+-----+----+

Price
+----+-----+-----+----+
|ID| Price| Size|Date|Sourceid|
+---- +----- +----- +----+
| 1 | 7.5 | 3 |2017-01-03| 223|
| 1 | 39 | 3 |2012-01-06| 223|
| 1 | 12 | 3 |2009-04-01| 223|
| 1 | 28 | 3 |2011-11-08| 223|
| 1 | 9 | 3 |2012-09-12| 223|
| 1 | 15 | 3 |2017-07-03| 220|
| 1 | 10 | 3 |2017-05-03 | 220|
| 1 | 33 | 3 |2012-03-08 | 220|
+----+-----+-----+----+

Firstly, I am trying to join the above two dataframes and return a data frame that contains only values that have the same ID and SourceID

I have tried to do that by doing the following:

c= quantity.join(price,price.id==quantity.id, price.souceid==quantity.sourceid "left")
c.show()  

This is the result I want to get but I'm not getting:

+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | NULL |7.5 |2017-01-03 |3 | 223|
| 1 | NULL | 9 |2012-01-06 |3 | 223|
| 1 | NULL | 12 |2009-04-01|3 | 223|
| 1 | NULL | 28 |2011-11-08| 3 | 223|
| 1 | NULL | 9 |2012-09-12| 3 | 223|
| 1 | NULL | 15 |2017-07-03 |3 | 220|
| 1 | NULL | 10 |2017-05-03 |3 | 220|
| 1 | NULL |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+

Secondly, after doing the join, I'm trying to get the difference in price between the min and max dates in the joined data frame and input it as the Price_perf

This is what I've tried:

def modify_values(c):
for x in c:
    if quantity.sourceid == price.sourceid:
        return price.price(min(Date)) - price.price(max(Date))
    else:
        return "Not found"

ol_val = udf(modify_values, StringType())
ol_val.show()  

So the final output should look something like this:

+----+-----+-----+----+
|ID| Price_perf|Price|Date| Size|Sourceid|
+---- +----- +----- +----+
| 1 | 4.5 |7.5 |2017-01-03 |3 | 223|
| 1 | 4.5 | 9 |2012-01-06 |3 | 223|
| 1 | 4.5 | 12 |2009-04-01|3 | 223|
| 1 | 4.5 | 28 |2011-11-08| 3 | 223|
| 1 | 4.5 | 9 |2012-09-12| 3 | 223|
| 1 | 18 | 15 |2017-07-03 |3 | 220|
| 1 | 18 | 10 |2017-05-03 |3 | 220|
| 1 | 18 |33 | 2012-03-08 |3 | 220|
+----+-----+-----+----+

k19Uch
  • 71
  • 2
  • 8

1 Answers1

0

If you only want matches then you actually want an inner join, which is the default type. And then since your column names are the same you can just list them out so that the resultant join only has one column for each instead of 2. Although normally you need to use && instead of a comma for multiple predicates

c = quantity.join(price,['id','sourceid'])
c.show()

As far as your price_perf, I'm not sure what you really want. The min and max are going to be constant within the same data, so your example doesn't make a lot of sense currently.

Justin Pihony
  • 66,056
  • 18
  • 147
  • 180
  • Hi @justin, thanks for your comments... As regarding the price_perf, I have more rows than I indicated in the df, I have added more rows in the example so as to enhance understanding. But I want to calculate a difference in the price from the highest and lowest dates (there are about 1000 dates and prices in the real data) – k19Uch Mar 22 '17 at 00:00
  • Then that part is not cut and dry. You're facing a SQL issue here: https://stackoverflow.com/questions/7745609/sql-select-only-rows-with-max-value-on-a-column – Justin Pihony Mar 22 '17 at 01:45