What is the best/ fastest way to rename a dataframe's columns?
I have noticed that the ~.withColumnRename function performs very poorly when renaming multiple columns, regardless of the size of the dataframe and it also does not use any of the executors.
What am I missing?
Here is my situation: I have a JSON object with thousands of values in a deeply nested structure. The names of the variables in the JSON structure keep repeating:
|-- taskData_data_variables: struct (nullable = true)
| |-- ProcessId: string (nullable = true)
| |-- CommentsReceived: struct (nullable = true)
| | |-- @metadata: struct (nullable = true)
| | | |-- dirty: boolean (nullable = true)
| | | |-- invalid: boolean (nullable = true)
| | | |-- objectID: string (nullable = true)
| | | |-- shared: boolean (nullable = true)
| | |-- items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- @metadata: struct (nullable = true)
| | | | | |-- className: string (nullable = true)
| | | | | |-- dirty: boolean (nullable = true)
| | | | | |-- invalid: boolean (nullable = true)
I now need to "spread out" the data into columns and one-to-many tables. As part of that process I need to rename the columns to preserve their source meta data.
At the end of the first iteration of the process I end up with a table like this:
|-- taskData_data_variables: struct (nullable = true)
|-- taskData_data_variables_ProcessId: string (nullable = true)
|-- taskData_data_variables_CommentsReceived: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_dirty: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_invalid: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_objectID: string (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_shared: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- @metadata: struct (nullable = true)
| | | | | |-- className: string (nullable = true)
| | | | | |-- dirty: boolean (nullable = true)
| | | | | |-- invalid: boolean (nullable = true)
My code works, but it takes forever, since there are hundreds of these values in various depths of nested levels.
I also have code that will then take the array and place it in another table.
I just need some guidance/ best practice advice to rename the columns in the most efficient way.
Here follows the code:
var dataframeRenamed = spark.sql("SELECT * FROM some_table")
val dfNewSchema = dataframeRenamed.schema
for (renameField <- dfNewSchema) {
dataframeRenamed = dataframeRenamed.withColumnRenamed(renameField.name, renameField.name + "_somethingElse")
}