1

I have a really large pyspark dataframe which gets data from json files. This is an example of the schema

 |-- Col1: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- Col2: struct (nullable = true)
 |    |-- Col2-Col1: string (nullable = true)
 |    |-- Col2-Col2: string (nullable = true)
 |    |-- Col2-Col3: string (nullable = true)

When I do the following, I'm not able to get all the column names within the struct.

df.columns
out: ['Col1', 'Col2']

I need to replace all the hyphens with an underscore so that I can write it to Hive. Hive does not accept '-', '[', '/' etc. within the column name.

For example,

The column names should change to

 |-- Col1: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- Col2: struct (nullable = true)
 |    |-- Col2_Col1: string (nullable = true)
 |    |-- Col2_Col2: string (nullable = true)
 |    |-- Col2_Col3: string (nullable = true)

The code needs to be generic enough such that many columns can be renamed without hard coding the values.

Bryce Ramgovind
  • 3,127
  • 10
  • 41
  • 72
  • 1
    Possible duplicate of [Rename nested field in spark dataframe](https://stackoverflow.com/questions/43004849/rename-nested-field-in-spark-dataframe) – pault Jun 21 '18 at 22:16

0 Answers0