I have some JSON files with a very long and multi-level schemas like
{ "a": {"b": {"c": [1, 2], "d": 3, "e": {...}}}}
But with hundreds of fields inside of each other. I do have a description of schema, but my problem is that datetime fields are stored like
"x": {"__encode__": "datetime", "value": "12314342.123"}
Since all of the datetime fields are scattered across the schema, is there an easier solution for converting them all into TimestampType
instead of having to iterate trough them and add run an UDF to convert to that?
Basically I would like to describe my schema with a UDF builtin, so it converts the field during the read time.
my_schema = StructType([
StructField("a", StructType([...])),
...
StructField("x", StructType([ # <-- I would like to pass a transformation function somehow
StructField("__encode__", StringType()),
StructField("value", DoubleType()),
])
])
df = spark.read.schema(my_schema).json("foo.json", mode="FAILFAST")
Edit: I found issues with the iterative approach of editing fields that
df.withcolumn('foo', to_timestamp(from_unixtime('foo.value')))
will only work with top level columns and to edit inner struct columns I have to recreate the whole struct and that is complicated with the several levels and I have also arrays in the middle.
But they only care about a single level and not an arbitrarily nested struct. So seems that I need to reproduce all that schema with fields replaced?
Isn't there any easier way?
Sample: I want to convert from this schema to that schema -- Specially the nested fields: Fields TimeA, TimeB, TimeC and TimeD
I am extracting the fields using the metadata attribute with this code: https://pastebin.com/E7H7GV9A