I think you need to add rowsBetween
with your window clause.
Example:
df.show()
#+---+---+
#| i| j|
#+---+---+
#| 1| a|
#| 1| b|
#| 1| c|
#| 2| c|
#+---+---+
w = Window.partitionBy("i").rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("count",count(col("j")).over(w)).show()
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 3|
#| 1| b| 3|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+
Usually when we have .orderBy
clause to window
then we need to have rowsBetween
needs to be added, as orderby clause defaults to unboundedPreceeding
and currentRow
.
w = Window.partitionBy("i").orderBy("j")
df.withColumn("count",count(col("j")).over(w)).show()
#incremental count
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 1|
#| 1| b| 2|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+
w = Window.partitionBy("i").orderBy("j").rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("count",count(col("j")).over(w)).show()
#total number of rows count
#+---+---+-----+
#| i| j|count|
#+---+---+-----+
#| 1| a| 3|
#| 1| b| 3|
#| 1| c| 3|
#| 2| c| 1|
#+---+---+-----+