I have delivery data. I want to sum
total ordered net_wt based on cust_id. In most of cases its easy as I can sum everything by grouping cust_id. However, I need to keep delivery date as well. But I got in to trouble as some orders have multiple deliveries date and then I am not sure how can I keep at least latest order date combined with sum of all orders irrespective of the delivery date.
cust_id date_delivery log_type net_wt
4776210 3/4/2021 Bulk 3880
4776210 3/4/2021 Bulk 6160
4787563 3/20/2021 Bulk 10360
4787563 3/20/2021 Bulk 3800
4787563 3/20/2021 Bulk 5020
4787563 3/20/2021 Bulk 2120
4787563 3/25/2021 Bulk 2100
4787563 3/25/2021 Bulk 2140
4792002 3/27/2021 Bulk 9042
4790494 3/25/2021 Bulk 3718
4790494 3/25/2021 Bulk 8102
required output
cust_id date_delivery log_type total_order
4776210 3/4/2021 Bulk 10040
4787563 ???????? Bulk 25540
4790494 3/25/2021 Bulk 11820
I have tried
df.createOrReplaceTempView('df')
df_test = spark.sql("""
SELECT cust_id, date_delivery,
SUM(net_wt) AS `total_order`
FROM df
GROUP BY 1
""")
display(df_test)
but its not working. I want one cust_id with atleast 1 corresponding date(could be most recent ndelivery_date).
any help would be apprciated
Thanks in advance