1

I have an RDD with 20 partitions from importing from Postgres via JDBC. I have a cluster with 5 workers (5 cores). I am simply trying to count the number of elements in each partition according to:

def count_in_a_partition(idx, iterator):
  count = 0
  for _ in iterator:
    count += 1
  return idx, count

rdd.mapPartitionsWithIndex(count_in_a_partition).collect()

The code above keeps running forever, and the Web GUI shows that the workers are not being utilized at all, i.e. "0 Used". Even the Memory in use shows 0.0 B Used. It seems there is something wrong. You would expect at least one of the workers to be doing something. What can I possibly do to speed up the computations and utilize the cores?

enter image description here

Community
  • 1
  • 1
FullStack
  • 5,902
  • 4
  • 43
  • 77
  • 1
    Hi, you should throw an eye in your spark configuration. Setting spark.executor.instances to the (number of cores - 1) works most of the time. Also you can reduce the memory used by executors. Ive never seen the waiting state... Are you sure that postgre works well ? – GwydionFR Sep 21 '16 at 12:23
  • I think you are correct to reduce memory used by executor. Thanks! – FullStack Sep 21 '16 at 16:47
  • could you share your spark-submit command? – avrsanjay Sep 21 '16 at 18:13

1 Answers1

0

I think that the Memory per Node: 20.0 GB being larger than the memory available in each of the nodes 2.7 GB is the issue. Lowering it helps:

from pyspark.sql import SparkSession

spark = SparkSession\
    .builder\
    .appName("jupyter-pyspark")\
    .master("spark://spark-master:7077")\
    .config("spark.executor.memory", "2g")\
    .config("spark.driver.memory", "2g")\
    .getOrCreate()
FullStack
  • 5,902
  • 4
  • 43
  • 77