RDD with 20 partitions in cluster but no workers being utilized

Question

I have an RDD with 20 partitions from importing from Postgres via JDBC. I have a cluster with 5 workers (5 cores). I am simply trying to count the number of elements in each partition according to:

def count_in_a_partition(idx, iterator):
  count = 0
  for _ in iterator:
    count += 1
  return idx, count

rdd.mapPartitionsWithIndex(count_in_a_partition).collect()

The code above keeps running forever, and the Web GUI shows that the workers are not being utilized at all, i.e. "0 Used". Even the Memory in use shows 0.0 B Used. It seems there is something wrong. You would expect at least one of the workers to be doing something. What can I possibly do to speed up the computations and utilize the cores?

Hi, you should throw an eye in your spark configuration. Setting spark.executor.instances to the (number of cores - 1) works most of the time. Also you can reduce the memory used by executors. Ive never seen the waiting state... Are you sure that postgre works well ? — GwydionFR, Sep 21 '16 at 12:23
I think you are correct to reduce memory used by executor. Thanks! — FullStack, Sep 21 '16 at 16:47

score 0 · Answer 1 · answered Sep 21 '16 at 16:43

I think that the Memory per Node: 20.0 GB being larger than the memory available in each of the nodes 2.7 GB is the issue. Lowering it helps:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("jupyter-pyspark")\
    .master("spark://spark-master:7077")\
    .config("spark.executor.memory", "2g")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()

RDD with 20 partitions in cluster but no workers being utilized

1 Answers1