2

The following code is pretty slow.
Is there a way of creating multiple columns at once over the same window, so Spark does not need to partition and order the data multiple times?

w = Window().partitionBy("k").orderBy("t")

df = df.withColumn(F.col("a"), F.last("a",True).over(w))
df = df.withColumn(F.col("b"), F.last("b",True).over(w))
df = df.withColumn(F.col("c"), F.last("c",True).over(w))
...
ZygD
  • 22,092
  • 39
  • 79
  • 102
Dusty
  • 121
  • 10

2 Answers2

2

You dont have to generate one column at a time. Use list comprehension. Code below

new=['a','b','c']
df = df.select(
    "*", *[F.last(x, True).over(w).alias(f"{x}") for x in new]
    
)
wwnde
  • 26,119
  • 6
  • 18
  • 32
  • 1
    This is an excellent suggestion. I needed to add 178 new columns based on 178 existing ones to a dataframe with 27 million rows. I had been doing this in a loop, but it took 4 hours. Then, when I added a condition to set some of the values as missing based on one of the other columns, I got errors every time, no matter how large a cluster I used. The following code, on the other hand, finished in seconds. `rolling_sum = movsum_cache.select("*", *[F.when(col("myr") < 201800, None).otherwise(F.sum(var).over(ms_window)).alias(f"ms_{var}") for var in val_vars])` – Paul de Barros Dec 20 '22 at 19:26
1

I'm not sure that Spark does partitioning and reordering several times, as you use the same window consecutively. However, .select is usually a better alternative than .withColumn.

df = df.select(
    "*",
    F.last("a", True).over(w).alias("a"),
    F.last("b", True).over(w).alias("b"),
    F.last("c", True).over(w).alias("c"),
)

To find out if partitioning and ordering is done several times, you need to analyse the df.explain() results.

ZygD
  • 22,092
  • 39
  • 79
  • 102
  • 1
    Why is select better than withColumn? – Dusty May 09 '22 at 06:16
  • 1
    Read [this](https://medium.com/@deepa.account/spark-dataframes-select-vs-withcolumn-31388cecbca9) and [this](https://stackoverflow.com/questions/59789689/spark-dag-differs-with-withcolumn-vs-select) if you want a better understanding. I don't think I can add something more than those guys. But most importantly, test the option from above, maybe you will see no difference, so it will mean that it's not this problem that you are having. – ZygD May 09 '22 at 06:45