I have an RDD with 20 partitions from importing from Postgres via JDBC. I have a cluster with 5 workers (5 cores). I am simply trying to count the number of elements in each partition according to:
def count_in_a_partition(idx, iterator):
count = 0
for _ in iterator:
count += 1
return idx, count
rdd.mapPartitionsWithIndex(count_in_a_partition).collect()
The code above keeps running forever, and the Web GUI shows that the workers are not being utilized at all, i.e. "0 Used". Even the Memory in use
shows 0.0 B Used
. It seems there is something wrong. You would expect at least one of the workers to be doing something. What can I possibly do to speed up the computations and utilize the cores?