I have a spark dataframe of ~70 mil rows with 3 columns ['id','date','val'] and a nested dictionary in the form of
dict = {
'A' = {
'2018-09-31' = val1,
'2018-10-01' = val2
}
}
A is the ID column and there is another column for the dates. I am trying to update val which is in another column based on this nested dictionary accessible by dict['A']['2018-09-31'] for example. Also, the update will only be if A is contained in a list, indexList.
I have looked at and tried the methods from below:
- Updating a dataframe column in spark
- replace values of one column in a spark df by dictionary key-values (pyspark)
- Pyspark: Replacing value in a column by searching a dictionary
Something like below wouldn't work
update_func = (F.when(F.col('id').isin(indexList), mydict[F.col('id')][F.col('date')]).otherwise(F.col('val')))
df = df.withColumn('new_val, update_func)
The error message I get is unhashable type: 'Column'
update: I avoided the problem by creating a new string key column combining the two columns used as keys