how to show only certain ID's in pyspark with aggregated values?

Question

I have two tables with different IDs. I want to join the tables with a join. How can I only display the three special IDs?

Table 1

    sp_date  | id  | value  | 
-------------+-----+--------+
 2021-05-07  |  15 |   1    | 
 2021-05-07  |  13 |   3    | 
 2021-05-07  |  15 |   4    | 
 2021-04-08  |  13 |   2    | 
 2021-04-08  |  10 |   8    |

Table 2

    sp_date  | id  | value  | 
-------------+-----+--------+
 2021-05-07  |  17 |   2    | 
 2021-04-08  |  12 |   7    | 
 2021-03-01  |  17 |   3    |

only the IDs (15 and 13) from table 1 and the ID (17) from table 2 should be displayed with their values. Id_1 represents the id 15. Id_2 represents the id 13 and Id_3 represents the id 17. the value of the ID's, which have the same date the values of them should be aggregate.

Output should be:

     date    | id_1  | id_2 | id_3  | 
-------------+-------+------+-------+
 2021-05-07  |  5    |   3  |   2   | 
 2021-04-08  |       |   2  |       | 
 2021-03-01  |       |      |   3   |

Pyspark Code

p_df = table 1 e_df = table 2

p_df = p_df.join(e_df, on=[
       p_df['sp_date'] == e_df['sk_date']])

p_df = p_df.groupBy(
        'sp_date', 'sk_date'
    ).agg(
        F.sum(F.when(F.col('id') == 15, F.col('value'))).alias('id_1'),
        F.sum(F.when(F.col('id') == 13, F.col('value'))).alias('id_2'),
        F.sum(F.when(F.col('id') == 17, F.col('value'))).alias('id_3'),
    ).orderBy(
        F.desc('sp_date'), F.desc('sk_date')
    )

with this code I get values that are too high for the respective columns. what am I doing wrong?

score 0 · Answer 1 · answered May 18 '21 at 19:05

0

If you want to select three special ID's from the tables. your approach/steps should be as follows:

Filtering before join:

p_df1 = select * from p_df where id in (13, 15,17)

e_df1 = select * from e_df where id in (13, 15,17)

Join on id column:

p_df = p_df1.join(e_df1, on=[p_df1['sp_date'] == e_df1['sk_date']])

aggregate using group by date, pivot by id, aggreate function = sum(val). (refer the answer here)

answered May 18 '21 at 19:05

user2846168

21
2

`p_df1 = select * from p_df where id in (13, 15,17)` this expression doesn't work for me. I put the selection before the join. but still get very high results. – Maco-ic May 19 '21 at 09:53

how to show only certain ID's in pyspark with aggregated values?

1 Answers1