0

I am measuring memory usage for an application (WordCount) in Spark with ps -p WorkerPID -o rss. However the results don’t make any sense. Because for every amount of data (1MB, 10MB, 100MB, 1GB, 10GB) there is the same amount of memory used. For 1GB and 10GB data the result of the measurement is even less than 1GB. Is Worker the wrong process for measuring memory usage? Which process of the Spark Process Model is responsible for memory allocation?

lary
  • 399
  • 2
  • 14

1 Answers1

2

Contrary to popular belief Spark doesn't have to load all data into main memory. Moreover WordCount is a trivial application and amount of required memory only marginally depends on the input:

  • amount of data loaded per partition with SparkContext.textFile depends on a configuration not input size (see for example: Why does partition parameter of SparkContext.textFile not take effect?).
  • size of the key-value pairs is roughly constant with typical input.
  • intermediate data can be spilled to disk if needed.
  • last but not least amount of memory used by executors is capped by a configuration.

Keeping all of that in mind behavior different than what you see would be troubling at best.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
  • Thank your for your answer. Could you recommend some additional information (a link or literature) about Sparks memory management? – lary Feb 10 '16 at 15:56