Timeout of PySpark countApprox() is not working

Asked Dec 13 '17 at 09:09

Active Dec 03 '19 at 10:15

Viewed 473 times

I'm working with Pyspark and Dataframes and I would like to know approximately if a Dataframe is greater than something.

I'm trying to use countApprox() function:

 df.rdd.countApprox(1000, 0.5)

But seems that in Pyspark the timeout is not working. I've seen that in Scala/Java the function is returning an object where you can check "low" and "high" values, but in Pyspark is returning only an integer. When the Dataframe is "big" is taking like minutes to get the countApprox() even if I put the timeout to 1000 milliseconds

Does anyone know if countApprox() works different or if is there any other function to know an approximation of the size of the Dataframe instead of the number of rows? I only need to know if a Dataframe is "very small" or "very big".

Thanks.

edited Dec 03 '19 at 10:15

asked Dec 13 '17 at 09:09

Javier Montón

4,601
3
21
29

Did this work for you? – suprita shankar Jul 31 '18 at 00:00
"very big", "small", etc is opinion based. Maybe what your looking for is here: [How to find spark RDD/Dataframe size?](https://stackoverflow.com/questions/35008123/how-to-find-spark-rdd-dataframe-size) – Ani Menon Jul 09 '20 at 15:57

Timeout of PySpark countApprox() is not working

0 Answers0