Spark: Filter & withColumn using row values?

Question

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?

Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.

Edit

<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:

df.withColumn('sim_count', 
                  f.when(
                    f.col("col_1").isNotNull(),
                    (
                        df.where(
                            f.col("price_list").between(f.col("col1"), f.col("col2"))
                        ).count()
                    )
                  ).otherwise(f.lit(None).cast(LongType()))
                  )

Possible duplicate of [Using a column value as a parameter to a spark DataFrame function](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) — pault, Nov 19 '19 at 21:41
Use `expr` for example: `df.where(f.expr("price_list between col1 and col2"))` — pault, Nov 19 '19 at 21:42
Ah ok will try this! I think I will store `expr` as a string column called `sim_expr` in my dataframe, and do `df.where(f.expr("sim_expr"))` instead — Megan, Nov 19 '19 at 22:44
Thanks, yes just tried and I get `pyspark.sql.utils.AnalysisException: "cannot resolve `sim_expr` — Megan, Nov 19 '19 at 22:47
@pault I think this is not the row value I am looking for. What I am looking for is the outer scope's row value, not the row value from the `df.where`. Is there a way to use the row value from `f.when` inside `df.where`. The calculation I am doing is a count of similar rows from the entire `df`, **for each row in the dataframe**. Thank you for any and all help — Megan, Nov 19 '19 at 23:26
Can you try to provide a [mcve] with a small sample dataframe and the corresponding desired output? — pault, Nov 20 '19 at 15:21

Spark: Filter & withColumn using row values?

0 Answers0