0

I am new to Scala and planning to convert PySpark code into Scala. In Pyspark, we can use same variable for multiple transformations. Below is the example:

 final_df=final_df.withColumn('xyz',final_df["items_summaries_marketplaceId"])
  #drop unwanted_columns
  unwanted_columns = [x for x in final_df.columns if 'xyz' in x or 'zxy' in x ]
  final_df=final_df.drop(*unwanted_columns) 

final_df is used for both transformations. I have converted this code to scala. As per my R&D, I will have to declare new variable after every transformation. Below is the code:

  val final_df=df.withColumn("xyz",df("items_summaries_marketplaceId"))
      val drop_language_cols=final_df.drop(final_df.columns.filter(_.contains("xyz")): _*)
      val drop__cols=drop_language_cols.drop(drop_language_cols.columns.filter(_.contains("zxt")): _*)

Do we have to declare a new variable after every transformation? Any help will be highly appreciated.

Nabeel Khan Ghauri
  • 125
  • 1
  • 4
  • 15

1 Answers1

0

"Saving" a variable name isn't really a good reason to introduce mutability to your code. There is nothing wrong with using multiple intermediate variables with different names if you prefer. In this specific case, you don't really need to though:

   val final_df = df
     .withColumn("xyz",df("items_summaries_marketplaceId"))
     .drop("xyz")
     .drop(df.columns.filter(_.contains("xyz"):_*)
     .drop(df.columns.filter(_.contains("zxy"):_*)

NB: I have no idea why you are adding that "xyz" column in the first place, if you are going to drop it right away.

Dima
  • 39,570
  • 6
  • 44
  • 70