1

Say I have the following data:

{"id":1, "payload":[{"foo":1, "lol":2},{"foo":2, "lol":2}]}

I would like to explode the payload and add a column to it, like this:

df = df.select('id', F.explode('payload').alias('data'))
df = df.withColumn('data.bar', F.col('data.foo') * 2)

However this results in a dataframe with three columns:

  • id
  • data
  • data.bar

I expected the data.bar to be part of the data struct...

How can I add a column to the exploded struct, instead of adding a top-level column?

surj
  • 4,706
  • 2
  • 25
  • 34
  • 1
    You'll have to rebuild the schema, use a `select`, or use a `udf` to modify the data - just about all these options are covered here: https://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe – Justin Pihony Sep 13 '17 at 19:25
  • Possible duplicate of [How to add a new Struct column to a DataFrame](https://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe) – T. Gawęda Sep 13 '17 at 19:42

1 Answers1

1
df = df.withColumn('data', f.struct(
    df['data']['foo'].alias('foo'),
   (df['data']['foo'] * 2).alias('bar')
))

This will result in:

root
 |-- id: long (nullable = true)
 |-- data: struct (nullable = false)
 |    |-- col1: long (nullable = true)
 |    |-- bar: long (nullable = true)

UPDATE:

def func(x):
    tmp = x.asDict()
    tmp['foo'] = tmp.get('foo', 0) * 100
    res = zip(*tmp.items())
    return Row(*res[0])(*res[1])

df = df.withColumn('data', f.UserDefinedFunction(func, StructType(
    [StructField('foo', StringType()), StructField('lol', StringType())]))(df['data']))

P.S.

Spark almost do not support inplace opreation.

So every time you want to do inplace, you need to do replace actually.

Zhang Tong
  • 4,569
  • 3
  • 19
  • 38
  • This is definitely going in the right direction! Is there a way to do this without knowing about the contents of `data` (except `data.foo` of course)? I edited my question to add an additional `data.lol` column to make this more clear. – surj Sep 14 '17 at 20:45