0

I have two dataframes, one with an estimated daily value, and another with the closed value for the month.

I need to show the estimated daily value ONLY when the closed value for the month does not exist.

Example:

df1:

DATA ID VALUE DSC
2022-01-31 123 10 CLOSED MONTH
2022-02-31 123 20 CLOSED MONTH
2022-03-31 999 30 CLOSED MONTH
2022-04-31 999 40 CLOSED MONTH

df2:

DATA ID VALUE DSC
2022-01-31 123 50 ESTIMATED DAY
2022-02-31 123 60 ESTIMATED DAY
2022-03-31 123 70 ESTIMATED DAY
2022-04-31 123 80 ESTIMATED DAY
2022-03-20 123 90 ESTIMATED DAY
2022-03-25 123 100 ESTIMATED DAY
2022-04-30 999 120 ESTIMATED DAY
2022-05-02 999 150 ESTIMATED DAY
2022-05-03 999 200 ESTIMATED DAY

EXPECTED OUTPUT:

DATA ID VALUE DSC
2022-01-31 123 10 CLOSED MONTH
2022-02-31 123 20 CLOSED MONTH
2022-03-31 999 30 CLOSED MONTH
2022-04-31 999 40 CLOSED MONTH
2022-03-20 123 90 ESTIMATED DAY -Because closed month 3 has different ID
2022-03-25 123 100 ESTIMATED DAY -Because closed month 3 has different ID
2022-05-02 999 150 ESTIMATED DAY -Because there is no closed month 5
2022-05-03 999 200 ESTIMATED DAY -Because there is no closed month 5

Does anyone know a solution?

I tried using window function Row_number, rank and dense_rank, but it didn't work.

2 Answers2

0

create another column which contains the closing date in both df1 and df2.

You can use the isIn function to filter out df2 lines based on df1[df1[closing_date, ID]]

Then, simply concat your 2 tables

0

Try using a join instead of a union

import pyspark.sql.functions as F

result = df1.join(df2, ["DATA", "ID"], "outer").select(
    "DATA",
    "ID",
    F.coalesce(df1.VALUE, df2.VALUE).alias("VALUE"),
    F.coalesce(df1.DSC, df2.DSC).alias("DSC"),
)

Does that produce the result that you expect?

pschale
  • 181
  • 1
  • 7