I have two tables with different IDs. I want to join the tables with a join. How can I only display the three special IDs?
Table 1
sp_date | id | value |
-------------+-----+--------+
2021-05-07 | 15 | 1 |
2021-05-07 | 13 | 3 |
2021-05-07 | 15 | 4 |
2021-04-08 | 13 | 2 |
2021-04-08 | 10 | 8 |
Table 2
sp_date | id | value |
-------------+-----+--------+
2021-05-07 | 17 | 2 |
2021-04-08 | 12 | 7 |
2021-03-01 | 17 | 3 |
only the IDs (15 and 13) from table 1 and the ID (17) from table 2 should be displayed with their values. Id_1 represents the id 15. Id_2 represents the id 13 and Id_3 represents the id 17. the value of the ID's, which have the same date the values of them should be aggregate.
Output should be:
date | id_1 | id_2 | id_3 |
-------------+-------+------+-------+
2021-05-07 | 5 | 3 | 2 |
2021-04-08 | | 2 | |
2021-03-01 | | | 3 |
Pyspark Code
p_df = table 1 e_df = table 2
p_df = p_df.join(e_df, on=[
p_df['sp_date'] == e_df['sk_date']])
p_df = p_df.groupBy(
'sp_date', 'sk_date'
).agg(
F.sum(F.when(F.col('id') == 15, F.col('value'))).alias('id_1'),
F.sum(F.when(F.col('id') == 13, F.col('value'))).alias('id_2'),
F.sum(F.when(F.col('id') == 17, F.col('value'))).alias('id_3'),
).orderBy(
F.desc('sp_date'), F.desc('sk_date')
)
with this code I get values that are too high for the respective columns. what am I doing wrong?