Pyspark: Convert datetime spark fields during read of JSON

Question

I have some JSON files with a very long and multi-level schemas like

{ "a": {"b": {"c": [1, 2], "d": 3, "e": {...}}}}

But with hundreds of fields inside of each other. I do have a description of schema, but my problem is that datetime fields are stored like

"x": {"__encode__": "datetime", "value": "12314342.123"}

Since all of the datetime fields are scattered across the schema, is there an easier solution for converting them all into TimestampType instead of having to iterate trough them and add run an UDF to convert to that?

Basically I would like to describe my schema with a UDF builtin, so it converts the field during the read time.

my_schema = StructType([
    StructField("a", StructType([...])),
    ...
    StructField("x", StructType([   # <-- I would like to pass a transformation function somehow
       StructField("__encode__", StringType()),
       StructField("value", DoubleType()),
    ])
])
df = spark.read.schema(my_schema).json("foo.json", mode="FAILFAST")

Edit: I found issues with the iterative approach of editing fields that

df.withcolumn('foo', to_timestamp(from_unixtime('foo.value')))

will only work with top level columns and to edit inner struct columns I have to recreate the whole struct and that is complicated with the several levels and I have also arrays in the middle.

I found: this and this

But they only care about a single level and not an arbitrarily nested struct. So seems that I need to reproduce all that schema with fields replaced?

Isn't there any easier way?

Sample: I want to convert from this schema to that schema -- Specially the nested fields: Fields TimeA, TimeB, TimeC and TimeD

I am extracting the fields using the metadata attribute with this code: https://pastebin.com/E7H7GV9A

AFAIK, there is no simple way to do this. You either have to explode all the nested arrays, update the timestamp fields then groupyby and recreate the original structure, or use some higher-order functions on arrays but again you'll have to list all the struct fields to recreate them when updating a field. You can see [this answer](https://stackoverflow.com/a/59239011/1386551) or [this one](https://stackoverflow.com/a/69911792/1386551) for example. — blackbishop, Dec 22 '21 at 17:36
@blackbishop Thank you for the reply. Since these structs are arbitrarily nested with more structs and arrays, I find this way too complicated to apply and could generate unforseen bugs IMO. I wanted something in the realm of `UserDefinedType` which was removed on Spark 2.0 or anything that lets me say to Spark that I want to decode some fields in a special way. — JBernardo, Dec 23 '21 at 17:43

Pyspark: Convert datetime spark fields during read of JSON

0 Answers0