I am quite new to pyspark and this problem is boggling me. Basically I am looking for a scalable way to loop typecasting through a structType or ArrayType.
Example of my data schema:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: integer (nullable = true)
| |-- rate_2: integer (nullable = true)
| |-- rate_3: integer (nullable = true)
| |-- card_fee: integer (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: integer (nullable = true)
| | |-- rate_2: integer (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
As you can see here, card_rates
is struct and online_rates
is an array of struct. I am looking ways to loop through all the fields above and conditionally typecast them. Ideally if it is supposed to be numeric, it should be converted to double, if it is supposed to be string, It should be converted to string. I need to loop because those rate_*
fields may grow with time.
But for now, I am content with being able to loop them and typecast all of them to string since I am very new with pyspark and still trying to get a feel of it.
My desired output schema:
root
|-- _id: string (nullable = true)
|-- created: timestamp (nullable = true)
|-- card_rates: struct (nullable = true)
| |-- rate_1: double (nullable = true)
| |-- rate_2: double (nullable = true)
| |-- rate_3: double (nullable = true)
| |-- card_fee: double (nullable = true)
| |-- payment_method: string (nullable = true)
|-- online_rates: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- rate_1: double (nullable = true)
| | |-- rate_2: double (nullable = true)
| | |-- online_fee: double (nullable = true)
|-- updated: timestamp (nullable = true)
I am running out ideas how to do this.
I got reference from here: PySpark convert struct field inside array to string
but this solution hardcodes the field and does not really loop over the fields.
Kindly help.