0

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

pault
  • 41,343
  • 15
  • 107
  • 149
Luigi
  • 181
  • 3
  • 15
  • 2
    Possible duplicate of [Getting the count of records in a data frame quickly](https://stackoverflow.com/questions/39357238/getting-the-count-of-records-in-a-data-frame-quickly) and maybe [Count on Spark Dataframe is extremely slow](https://stackoverflow.com/questions/45142105/count-on-spark-dataframe-is-extremely-slow) – pault Apr 09 '19 at 15:01
  • Short answer is no, but if you cache it will speed up subsequent calls to count. – pault Apr 09 '19 at 15:03
  • Aren't there even approximate methods? – Luigi Apr 10 '19 at 15:54
  • try `df.rdd.countApprox()` perhaps – pault Apr 10 '19 at 16:16

1 Answers1

-2

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

kamprath
  • 2,220
  • 1
  • 23
  • 28
  • I'm trying, but the running time continues to be rather long, although I am using a factor of 0.1. – Luigi Apr 10 '19 at 15:29
  • Is the data partitioned well? If not, you might not be leveraging all of your executors. For that matter, what is your partition to executor ratio? – kamprath Apr 10 '19 at 17:56
  • I didn't understand what you mean. However, I use Google Colab to run the code and I simply replaced the df.count() operation with df.sample(0.1).count() and rerun the code. Would there be anything else to set? – Luigi Apr 10 '19 at 22:09
  • To get the partition count for your dataframe, call `df.rdd.getNumPartitions()`. If that value is 1, your data has not been parallelized and thus you aren't getting the benefit of multiple nodes or cores in your spark cluster., If you do get a value greater than 1 (ideally, closer to 200), then the next thing to look at is know the number of available executors your spark cluster has. You do this by looking at the Spark status web page for your cluster. – kamprath Apr 11 '19 at 16:42
  • I am trying to set the number of partitions with df.coalesce() method, but Colab doesn't generate more than four partitions. There is only one executor, I don't know how to increase them on Google Colab. However, Colab uses an hex core processor. – Luigi Apr 11 '19 at 19:54
  • First, `df.coalesce()` is used to reduce partitions, not increase. Use `repartition()` to increase partitions. Second, the heart of your problem is that you have only one executor. You get no benefits of parallelism, and in fact, only the overhead, if you have only one executor. I suggest posting another question asking how to increase the executor count in Google Colab. – kamprath Apr 14 '19 at 22:40