0

What is the best/ fastest way to rename a dataframe's columns?

I have noticed that the ~.withColumnRename function performs very poorly when renaming multiple columns, regardless of the size of the dataframe and it also does not use any of the executors.

What am I missing?

Here is my situation: I have a JSON object with thousands of values in a deeply nested structure. The names of the variables in the JSON structure keep repeating:

 |-- taskData_data_variables: struct (nullable = true)
 |    |-- ProcessId: string (nullable = true)
 |    |-- CommentsReceived: struct (nullable = true)
 |    |    |-- @metadata: struct (nullable = true)
 |    |    |    |-- dirty: boolean (nullable = true)
 |    |    |    |-- invalid: boolean (nullable = true)
 |    |    |    |-- objectID: string (nullable = true)
 |    |    |    |-- shared: boolean (nullable = true)
 |    |    |-- items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- @metadata: struct (nullable = true)
 |    |    |    |    |    |-- className: string (nullable = true)
 |    |    |    |    |    |-- dirty: boolean (nullable = true)
 |    |    |    |    |    |-- invalid: boolean (nullable = true)

I now need to "spread out" the data into columns and one-to-many tables. As part of that process I need to rename the columns to preserve their source meta data.

At the end of the first iteration of the process I end up with a table like this:

|-- taskData_data_variables: struct (nullable = true)
|-- taskData_data_variables_ProcessId: string (nullable = true)
|-- taskData_data_variables_CommentsReceived: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata: struct (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_dirty: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_invalid: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_objectID: string (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_shared: boolean (nullable = true)
|-- taskData_data_variables_CommentsReceived_metadata_items: array (nullable = true)
|    |    |    |-- element: struct (containsNull = true)
|    |    |    |    |-- @metadata: struct (nullable = true)
|    |    |    |    |    |-- className: string (nullable = true)
|    |    |    |    |    |-- dirty: boolean (nullable = true)
|    |    |    |    |    |-- invalid: boolean (nullable = true)

My code works, but it takes forever, since there are hundreds of these values in various depths of nested levels.

I also have code that will then take the array and place it in another table.

I just need some guidance/ best practice advice to rename the columns in the most efficient way.

Here follows the code:

var dataframeRenamed = spark.sql("SELECT * FROM some_table")
val dfNewSchema = dataframeRenamed.schema

for (renameField <- dfNewSchema) {
        dataframeRenamed = dataframeRenamed.withColumnRenamed(renameField.name, renameField.name + "_somethingElse")
}
OzzMaster
  • 41
  • 1
  • 5
  • 2
    Looks like you forgot to include the code ([mcve], [How to make good reproducible Apache Spark Dataframe examples](https://stackoverflow.com/q/48427185/8371915)) – Alper t. Turker Aug 14 '18 at 09:31
  • I contend that I did. I want to understand why this operation takes so long. Where is it running? What does it do? Is there a better way to rename thousands of columns in a dataframe? – OzzMaster Feb 26 '19 at 06:35

0 Answers0