I have two data sets with the following data:
Given the interchange_rate value in the source file, I need to get the closest match of rebate_rate from the lookup table. In the image you can see that the transaction_id 2 has the highest interchange_rate and therefore, I need to get the highest rebate_rate; transaction_id 1 is the lowest interchange_rate and therefore get the lowest rebate_rate.
The red column I did manually using Excel just to show it as an example, but that's the expected output.
My initial idea is to loop through the rows in the source file, and for each line search for the closest match in the lookup table. But I'm not a very experienced PySpark developer. I'm looking for help to write code to accomplish this task.
My first try was to use the foreach() method in the source file dataframe, but I get a PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
def get_rebate(row):
buyer_name = row["buyer_name"]
df_buyer = df_lookup.where(f"buyer_name == '{buyer_name}'")
row["rebate_rate"] = df_buyer.select("rebate_rate").first()
return row
# df_final is the source file after a few cosmetics transformations. I need to add a new column "rebate_rate" to it
df_final.foreach(lambda x: get_rebate(x))