4

In PySpark, would it be possible to obtain the total number of rows in a particular window?

Right now I am using:

w = Window.partitionBy("column_to_partition_by")

F.count(col("column_1")).over(w)

However, this only gives me the incremental row count. What I need is the total number of rows in that particular window partition. Can anyone tell me the command for this?

Menno Van Dijk
  • 863
  • 6
  • 24
  • Impossible to tell what's wrong based on the information you've provided. What you have looks correct. Please update with a [mcve]. Refer to [How to make good reproducible spark DataFrame examples](https://stackoverflow.com/questions/48427185/how-to-make-good-reproducible-apache-spark-examples). – pault Mar 13 '20 at 15:24
  • 1
    @pault Oh come on. This should not be a question that requires an MRE. The code is correct, that's what I pointed out in my question. However, it does not yield the result that I would like. – Menno Van Dijk Mar 13 '20 at 15:29
  • *The code is correct, that's what I pointed out in my question.* - This is precisely the reason that you need an MRE here. The code you provided should do exactly what you're asking. If it does not, then you need to provide an example that shows how it's not yielding the result you would like. Also, it's unclear what "incremental row count" means. – pault Mar 13 '20 at 15:31
  • See for example: [Adding a group count column to a PySpark dataframe](https://stackoverflow.com/questions/48793701/adding-a-group-count-column-to-a-pyspark-dataframe) – pault Mar 13 '20 at 15:33
  • @pault I'm sorry but I honestly do not see your point. The code is correct but *does not do what it is intended to do*. Hence, I ask how one could extend this code to make it work for my example. – Menno Van Dijk Mar 13 '20 at 15:35
  • My point is that you haven't explained what you intended to do, i.e. *why* is this the wrong answer? – pault Mar 13 '20 at 15:37

1 Answers1

4

I think you need to add rowsBetween with your window clause.

Example:

df.show()
#+---+---+
#|  i|  j|
#+---+---+
#|  1|  a|
#|  1|  b|
#|  1|  c|
#|  2|  c|
#+---+---+

w = Window.partitionBy("i").rowsBetween(-sys.maxsize,sys.maxsize)

df.withColumn("count",count(col("j")).over(w)).show()
#+---+---+-----+
#|  i|  j|count|
#+---+---+-----+
#|  1|  a|    3|
#|  1|  b|    3|
#|  1|  c|    3|
#|  2|  c|    1|
#+---+---+-----+

Usually when we have .orderBy clause to window then we need to have rowsBetween needs to be added, as orderby clause defaults to unboundedPreceeding and currentRow.

w = Window.partitionBy("i").orderBy("j")
df.withColumn("count",count(col("j")).over(w)).show()

#incremental count
#+---+---+-----+
#|  i|  j|count|
#+---+---+-----+
#|  1|  a|    1|
#|  1|  b|    2|
#|  1|  c|    3|
#|  2|  c|    1|
#+---+---+-----+
w = Window.partitionBy("i").orderBy("j").rowsBetween(-sys.maxsize,sys.maxsize)
df.withColumn("count",count(col("j")).over(w)).show()

#total number of rows count
#+---+---+-----+
#|  i|  j|count|
#+---+---+-----+
#|  1|  a|    3|
#|  1|  b|    3|
#|  1|  c|    3|
#|  2|  c|    1|
#+---+---+-----+
notNull
  • 30,258
  • 4
  • 35
  • 50
  • `.rowsBetween(-sys.maxsize,sys.maxsize)` is meaningless in your first example. If you remove it, the answer is exactly the same. In the second example, it negates the impact of the `orderby`. – pault Mar 13 '20 at 15:36
  • 2
    You're a hero. This yields exactly what I would want, thank you very much. – Menno Van Dijk Mar 13 '20 at 15:39
  • @pault It does not. Without the .rowsBetween I do not get my answer, when using .rowsBetween I get exactly my answer. – Menno Van Dijk Mar 13 '20 at 15:40
  • @MennoVanDijk okay if you say so. Glad you solved your issue. – pault Mar 13 '20 at 15:41