Scala: Declaring new variable after each transformation(dataframe)

Question

I am new to Scala and planning to convert PySpark code into Scala. In Pyspark, we can use same variable for multiple transformations. Below is the example:

 final_df=final_df.withColumn('xyz',final_df["items_summaries_marketplaceId"])
  #drop unwanted_columns
  unwanted_columns = [x for x in final_df.columns if 'xyz' in x or 'zxy' in x ]
  final_df=final_df.drop(*unwanted_columns)

final_df is used for both transformations. I have converted this code to scala. As per my R&D, I will have to declare new variable after every transformation. Below is the code:

  val final_df=df.withColumn("xyz",df("items_summaries_marketplaceId"))
      val drop_language_cols=final_df.drop(final_df.columns.filter(_.contains("xyz")): _*)
      val drop__cols=drop_language_cols.drop(drop_language_cols.columns.filter(_.contains("zxt")): _*)

Do we have to declare a new variable after every transformation? Any help will be highly appreciated.

Have a look at [this question](https://stackoverflow.com/q/1791408/2129801) — werner, Oct 01 '22 at 20:09
Short answer: yes. Scala code is meant to embrace immutability. — Gaël J, Oct 01 '22 at 21:29

score 0 · Answer 1 · answered Oct 03 '22 at 14:56

"Saving" a variable name isn't really a good reason to introduce mutability to your code. There is nothing wrong with using multiple intermediate variables with different names if you prefer. In this specific case, you don't really need to though:

   val final_df = df
     .withColumn("xyz",df("items_summaries_marketplaceId"))
     .drop("xyz")
     .drop(df.columns.filter(_.contains("xyz"):_*)
     .drop(df.columns.filter(_.contains("zxy"):_*)

NB: I have no idea why you are adding that "xyz" column in the first place, if you are going to drop it right away.

Scala: Declaring new variable after each transformation(dataframe)

1 Answers1