I am working with Spark dataframes and want to update a column column_to_be_updated
in a hive-table using spark-sql in Scala.
My code so far does work with smaller dataframes:
var data_frame = spark.sql("Select ... From TableXX")
var id_list = spark.sql("Select Id From TableXY Where ...")..collect().map(_(0)).toList
data_frame.withColumn("column_to_be_updated", when($"other_column_of_frame".isin(id_list:_*), 1)
.otherwise($"column_to_be_updated"))
What I want is to updated the column column_to_be_updated
if the entry in other_column-of_frame
is in the id-column of TableXY
. My workaround is to cast the id-column first to a list and then use the .isin
-statement.
However, I have a lot of rows in TableXY
and TableXX
so it seems to crash and overload the id_list
. Is there any other workaround or more efficient solution for what I am trying to achieve?
Thanks in advance!