2

I have a spark dataframe of ~70 mil rows with 3 columns ['id','date','val'] and a nested dictionary in the form of

dict = {
    'A' = {
        '2018-09-31' = val1,
        '2018-10-01' = val2
    }
}

A is the ID column and there is another column for the dates. I am trying to update val which is in another column based on this nested dictionary accessible by dict['A']['2018-09-31'] for example. Also, the update will only be if A is contained in a list, indexList.

I have looked at and tried the methods from below:

  1. Updating a dataframe column in spark
  2. replace values of one column in a spark df by dictionary key-values (pyspark)
  3. Pyspark: Replacing value in a column by searching a dictionary

Something like below wouldn't work

update_func = (F.when(F.col('id').isin(indexList), mydict[F.col('id')][F.col('date')]).otherwise(F.col('val')))
df = df.withColumn('new_val, update_func)

The error message I get is unhashable type: 'Column'

update: I avoided the problem by creating a new string key column combining the two columns used as keys

tkim
  • 21
  • 3

0 Answers0