I need to create a column called sim_count
for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row
values while using when
?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val>
refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where
. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)