0

I've been trying to update the nested field value in a Pyspark DataFrame. I followed the answer given at How to update a value in the nested column of struct using pyspark but it is not till the level I want.

json data  
{
  "documentKey": {
    "_id": "1234567"
  },
    "fullDocument": {
        "did": "1fcee68a43c500e0",
        "sg": {
            "media_ended_timestamp": 1626940125,
            "media_id": 56010
        },
        "ts": "ts"
  }
}

Now, let's say I want to update the field fullDocument.sg.media_id to 11111 from 56010. What could be the possible way to do so?

Note: With the answer mentioned in the link I pasted above, I was able to update fullDocument.did successfully.

spark: 3.1.1 python: 3.9

Vaebhav
  • 4,672
  • 1
  • 13
  • 33
Amol
  • 93
  • 7

1 Answers1

0

I was able to do it with below piece of code

df = df.select('*', 'fullDocument.*') \
    .select('*', 'sg.*') \
    .withColumn('media_id', lit('11111')) \
    .withColumn('sg', F.struct(*[F.col(col) for col in df.select('fullDocument.sg.*').columns])) \
    .withColumn('fullDocument', F.struct(*[F.col(col) for col in df.select('fullDocument.*').columns])) \
    .drop(*df.select('fullDocument.*').columns) \
    .drop(*df.select('fullDocument.sg.*').columns)
Amol
  • 93
  • 7