How to union two dataframes but giving preference to one if it has data in the same month pyspark

Question

I have two dataframes, one with an estimated daily value, and another with the closed value for the month.

I need to show the estimated daily value ONLY when the closed value for the month does not exist.

Example:

df1:

DATA	ID	VALUE	DSC
2022-01-31	123	10	CLOSED MONTH
2022-02-31	123	20	CLOSED MONTH
2022-03-31	999	30	CLOSED MONTH
2022-04-31	999	40	CLOSED MONTH

df2:

DATA	ID	VALUE	DSC
2022-01-31	123	50	ESTIMATED DAY
2022-02-31	123	60	ESTIMATED DAY
2022-03-31	123	70	ESTIMATED DAY
2022-04-31	123	80	ESTIMATED DAY
2022-03-20	123	90	ESTIMATED DAY
2022-03-25	123	100	ESTIMATED DAY
2022-04-30	999	120	ESTIMATED DAY
2022-05-02	999	150	ESTIMATED DAY
2022-05-03	999	200	ESTIMATED DAY

EXPECTED OUTPUT:

DATA	ID	VALUE	DSC
2022-01-31	123	10	CLOSED MONTH
2022-02-31	123	20	CLOSED MONTH
2022-03-31	999	30	CLOSED MONTH
2022-04-31	999	40	CLOSED MONTH
2022-03-20	123	90	ESTIMATED DAY -Because closed month 3 has different ID
2022-03-25	123	100	ESTIMATED DAY -Because closed month 3 has different ID
2022-05-02	999	150	ESTIMATED DAY -Because there is no closed month 5
2022-05-03	999	200	ESTIMATED DAY -Because there is no closed month 5

Does anyone know a solution?

I tried using window function Row_number, rank and dense_rank, but it didn't work.

I'm not able to understands what you want, the statement is not clear, what do you mean by "when the closed value for the month does not exist." — Abdennacer Lachiheb, Dec 08 '22 at 20:45
The 'closed value' only appears on the last day of the month, and the 'estimated day' is entered daily, every day. The 'estimated day' should only be shown until the 'closed value' appears — Gustavo Morais Oliveira, Dec 08 '22 at 20:51
Why the date "2022-04-30" doesn't appear in the expected output? — Abdennacer Lachiheb, Dec 09 '22 at 12:55
Because there is the 'closed month' (2022-04-31) for month 4 — Gustavo Morais Oliveira, Dec 10 '22 at 03:38

john_hatten2 · Answer 1 · 2022-12-08T18:20:57.050

0

create another column which contains the closing date in both df1 and df2.

You can use the isIn function to filter out df2 lines based on df1[df1[closing_date, ID]]

Then, simply concat your 2 tables

edited Dec 08 '22 at 18:20

answered Dec 08 '22 at 18:17

john_hatten2

33
6

I created a YEAR-MONTH column for both. I'm trying to like: df2 = df2[~df2[YEAR-MONTH, ID].isin(df) but it doesn't work – Gustavo Morais Oliveira Dec 08 '22 at 21:17

score 0 · Answer 2 · answered Dec 09 '22 at 01:01

0

Try using a join instead of a union

import pyspark.sql.functions as F

result = df1.join(df2, ["DATA", "ID"], "outer").select(
    "DATA",
    "ID",
    F.coalesce(df1.VALUE, df2.VALUE).alias("VALUE"),
    F.coalesce(df1.DSC, df2.DSC).alias("DSC"),
)

Does that produce the result that you expect?

answered Dec 09 '22 at 01:01

pschale

181
1
7

That way doesn't work because the date is different for the dataframes – Gustavo Morais Oliveira Dec 10 '22 at 07:42
Do you mean that `df2` contains dates that aren't in `df1`? If so, that's ok -- that's why I set the join type to "outer" – pschale Dec 11 '22 at 21:33

How to union two dataframes but giving preference to one if it has data in the same month pyspark

2 Answers2