3

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?

Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.

Edit

<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:

df.withColumn('sim_count', 
                  f.when(
                    f.col("col_1").isNotNull(),
                    (
                        df.where(
                            f.col("price_list").between(f.col("col1"), f.col("col2"))
                        ).count()
                    )
                  ).otherwise(f.lit(None).cast(LongType()))
                  )
Megan
  • 1,000
  • 1
  • 14
  • 44
  • Possible duplicate of [Using a column value as a parameter to a spark DataFrame function](https://stackoverflow.com/questions/51140470/using-a-column-value-as-a-parameter-to-a-spark-dataframe-function) – pault Nov 19 '19 at 21:41
  • 1
    Use `expr` for example: `df.where(f.expr("price_list between col1 and col2"))` – pault Nov 19 '19 at 21:42
  • Ah ok will try this! I think I will store `expr` as a string column called `sim_expr` in my dataframe, and do `df.where(f.expr("sim_expr"))` instead – Megan Nov 19 '19 at 22:44
  • Unfortunately, I don't think that would work. – pault Nov 19 '19 at 22:46
  • Thanks, yes just tried and I get `pyspark.sql.utils.AnalysisException: "cannot resolve `sim_expr` – Megan Nov 19 '19 at 22:47
  • @pault I think this is not the row value I am looking for. What I am looking for is the outer scope's row value, not the row value from the `df.where`. Is there a way to use the row value from `f.when` inside `df.where`. The calculation I am doing is a count of similar rows from the entire `df`, **for each row in the dataframe**. Thank you for any and all help – Megan Nov 19 '19 at 23:26
  • 1
    Can you try to provide a [mcve] with a small sample dataframe and the corresponding desired output? – pault Nov 20 '19 at 15:21

0 Answers0