I have a really large pyspark dataframe which gets data from json files. This is an example of the schema
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2-Col1: string (nullable = true)
| |-- Col2-Col2: string (nullable = true)
| |-- Col2-Col3: string (nullable = true)
When I do the following, I'm not able to get all the column names within the struct.
df.columns
out: ['Col1', 'Col2']
I need to replace all the hyphens with an underscore so that I can write it to Hive. Hive does not accept '-', '[', '/' etc. within the column name.
For example,
The column names should change to
|-- Col1: array (nullable = true)
| |-- element: double (containsNull = true)
|-- Col2: struct (nullable = true)
| |-- Col2_Col1: string (nullable = true)
| |-- Col2_Col2: string (nullable = true)
| |-- Col2_Col3: string (nullable = true)
The code needs to be generic enough such that many columns can be renamed without hard coding the values.