How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

Question

This solution in theory, works perfectly for what I need, which is to create a new copied version of a dataframe while excluding certain nested structfields. here is a minimally reproducible artifact of my issue:

>>> df.printSchema()
root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)
| | | -- delete: string(nullable=true)

which you can instantiate like such:

schema = StructType([StructField("big", ArrayType(StructType([
    StructField("keep", StringType()),
    StructField("delete", StringType())
])))])
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), schema)

My goal is to convert the dataframe (along with the values in the columns I want to keep) to one that excludes certain nested structs, like delete for example.

root
| -- big: array(nullable=true)
| | -- element: struct(containsNull=true)
| | | -- keep: string(nullable=true)

According to the solution I linked that tries to leverage pyspark.sql's to_json and from_json functions, it should be accomplishable with something like this:

new_schema = StructType([StructField("big", ArrayType(StructType([
             StructField("keep", StringType())
])))])

test_df = df.withColumn("big", to_json(col("big"))).withColumn("big", from_json(col("big"), new_schema))

>>> test_df.printSchema()
root
| -- big: struct(nullable=true)
| | -- big: array(nullable=true)
| | | -- element: struct(containsNull=true)
| | | | -- keep: string(nullable=true)

>>> test_df.show()
+----+
| big|
+----+
|null|
+----+

So either I'm not following his directions right, or it doesn't work. How do you do this without a udf?

Pyspark to_json documentation Pyspark from_json documentation

score 1 · Accepted Answer · answered Oct 04 '19 at 22:08

1

It should be working, you just need to adjust your new_schema to include metadata for the column 'big' only, not for the dataframe:

new_schema = ArrayType(StructType([StructField("keep", StringType())]))

test_df = df.withColumn("big", from_json(to_json("big"), new_schema))

answered Oct 04 '19 at 22:08

jxc

13,553
4
16
34

Good catch, I never thought to try that because the pyspark documentation made it seem like I needed to pass in a structtype. >schema – a StructType to use when parsing the json column – notacorn Oct 04 '19 at 22:36
1

@ark0n, it's basically the datatype of a particular column, so can be any complex data type in pyspark. you can also use `df.select('big').schema.simpleString()` to retrieve and modify this information. (just make sure use back-tick to enclose field names if some of them contain special chars like SPACES, dot ) – jxc Oct 04 '19 at 22:49

How to use to_json and from_json to eliminate nested structfields in pyspark dataframe?

1 Answers1

Linked