Spark function "approx_count_distinct" is getting stuck / very expensive

Question

I am running Pyspark in Jupyter Notebook. I am trying to study the feasibility of manipulating the estimation of the "approx_count_distinct" function and for that I have created a spark dataframe with 100.000 random numbers:

C = 100000

data =  random.sample(range(1, 10000000000), C)

schema = StructType([StructField('test', LongType(), True)])

#Spark dataframe with C different unique numbers
S = spark.createDataFrame([[x] for x in data], schema)

Then I created 3 more empty dataframes for the future calculations:

#Creates Empty RDD
emptyRDD = spark.sparkContext.emptyRDD()

HLL = spark.createDataFrame(data = emptyRDD, schema = schema) 
Y = spark.createDataFrame(data = emptyRDD, schema = schema)
V = spark.createDataFrame(data = emptyRDD, schema = schema)

The problem is when I run the following loop, it takes forever and seems like it won't end (I let it run for approximately 15 hours without results ):

data_itr = S.rdd.toLocalIterator()

for row in data_itr:

    #first we calculate the HLL estimation of the dataframe HLL (in the first iteration it is 0 because it is empty)
    HLL_before = HLL.select(approx_count_distinct("test")).collect()[0][0]
    
    #then we insert new element son the HLL dataframe
    newRow = spark.createDataFrame([row["test"]], LongType())
    HLL = HLL.union(newRow)
    
    #finally we calculate again the HLL estimation after the insertion
    HLL_after = HLL.select(approx_count_distinct("test")).collect()[0][0]
    
    if (HLL_after == HLL_before):
        Y = Y.union(newRow)

I need to go through each row because I need to know if the "approx_count_distinct" estimation changes after the insertion of each element.

The code is working for 10.000 rows but when I increase it till 100.000 it doesn't finish. Does anyone know what could be going on ? why is the run time growing so fast?

Thank you!

You need to read/ learn about Spark. 98% of cases can be done without looping or collect(). Those are your huge bottlenecks. Never do calculations going from one row to another. Learn to do it for the whole column at once. There's just too many things to try to explain. You should first do your homework - learn not to use collect and loops. Start with something smaller maybe. This code is already too big to learn, let alone "test feasibility". Spark is fast, but you need to write it correctly. — ZygD, Jul 26 '22 at 10:52
Counting distinct elements may be expensive, but not when you use approx function. So it's not your case. — ZygD, Jul 26 '22 at 11:02
I am doing it row by row because I am interested in keeping the exact numbers that doesn't increase the count of distinct elements :( — Beatriz Correa Castillo, Jul 26 '22 at 11:47
Is the random number generation guaranteed to produce distinct numbers? Isn't that needed to carry out this test? — viggnah, Jul 26 '22 at 13:06
what are you trying to achieve here? there are too many `toPandas()` and `collect()` used. you create a spark dataframe, convert it to pandas, and then again re-create the spark dataframe using the pandas dataframe -- why? if you want to append a row, spark's `unionAll()` can be used to avoid the expensive `toPandas()`. multiple collections to driver will also result in huge gc. — samkart, Jul 26 '22 at 14:06
@viggnah thanks for the appreciation. I had assumed the numbers would be different. I have already changed it following this issue: (https://stackoverflow.com/questions/22842289/generate-n-unique-random-numbers-within-a-range ) — Beatriz Correa Castillo, Jul 26 '22 at 18:14
@samkart the truth is that I had made a mess trying different things. I was converting spark dataframe to pandas because I had read in a reply to a similar topic that this improved the computation time. — Beatriz Correa Castillo, Jul 26 '22 at 18:18
`I am interested in keeping the exact numbers` If you want an exact answer, why not use countDistinct? https://spark.apache.org/docs/3.1.1/api/python/reference/api/pyspark.sql.functions.countDistinct.html#pyspark.sql.functions.countDistinct You could compare the answer from the approximation to the exact answer. — Nick ODell, Jul 27 '22 at 03:31
@NickODell Maybe I haven't explained myself well. I don't want the exact number of distinct elements but the elements that after being inserted in the dataframe do not increase the cardinality, that is the estimation of distinct elements. — Beatriz Correa Castillo, Jul 27 '22 at 09:21

Spark function "approx_count_distinct" is getting stuck / very expensive

0 Answers0