PySpark - Remove Illegal Hive Character from schema

Question

I have a really large pyspark dataframe which gets data from json files. This is an example of the schema

 |-- Col1: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- Col2: struct (nullable = true)
 |    |-- Col2-Col1: string (nullable = true)
 |    |-- Col2-Col2: string (nullable = true)
 |    |-- Col2-Col3: string (nullable = true)

When I do the following, I'm not able to get all the column names within the struct.

df.columns
out: ['Col1', 'Col2']

I need to replace all the hyphens with an underscore so that I can write it to Hive. Hive does not accept '-', '[', '/' etc. within the column name.

For example,

The column names should change to

 |-- Col1: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- Col2: struct (nullable = true)
 |    |-- Col2_Col1: string (nullable = true)
 |    |-- Col2_Col2: string (nullable = true)
 |    |-- Col2_Col3: string (nullable = true)

The code needs to be generic enough such that many columns can be renamed without hard coding the values.

Possible duplicate of [Rename nested field in spark dataframe](https://stackoverflow.com/questions/43004849/rename-nested-field-in-spark-dataframe) — pault, Jun 21 '18 at 22:16

PySpark - Remove Illegal Hive Character from schema

0 Answers0