I'm working with Pyspark and Dataframes and I would like to know approximately if a Dataframe is greater than something.
I'm trying to use countApprox()
function:
df.rdd.countApprox(1000, 0.5)
But seems that in Pyspark the timeout is not working. I've seen that in Scala/Java the function is returning an object where you can check "low" and "high" values, but in Pyspark is returning only an integer. When the Dataframe is "big" is taking like minutes to get the countApprox()
even if I put the timeout to 1000 milliseconds
Does anyone know if countApprox()
works different or if is there any other function to know an approximation of the size of the Dataframe instead of the number of rows? I only need to know if a Dataframe is "very small" or "very big".
Thanks.