3

Now I have to create a parallelized collection using sc.parallelize() in pyspark (Spark 2.1.0).

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

It seems that the collection is still being kept in spark's memory of the master node after I parallelize it to each worker node. Here's an example of my code:

# my python code
sc = SparkContext()
a = [1.0] * 1000000000
rdd_a = sc.parallelize(a, 1000000)
sum = rdd_a.reduce(lambda x, y: x+y)

I've tried

del a

to destroy it, but it didn't work. The spark which is a java process is still using a lot of memory.

After I create rdd_a, how can I destroy a to free the master node's memory?

Thanks!

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
PengNi
  • 33
  • 4

3 Answers3

2

The job of the master is to coordinate the workers and to give a worker a new task once it has completed its current task. In order to do that, the master needs to keep track of all of the tasks that need to be done for a given calculation.

Now, if the input were a file, the task would simply look like "read file F from X to Y". But because the input was in memory to begin with, the task looks like 1,000 numbers. And given the master needs to keep track of all 1,000,000 tasks, that gets quite large.

Joe C
  • 15,324
  • 8
  • 38
  • 50
2

The collection in my driver program is big. when I parallelize it, I found it takes up a lot of memory in master node.

That's how it supposed to be and that's why SparkContext.parallelize is only meant for demos and learning purposes, i.e. for quite small datasets.

Quoting the scaladoc of parallelize

parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism): RDD[T] Distribute a local Scala collection to form an RDD.

Note "a local Scala collection" that means that the collection you want to map to a RDD (or create a RDD from) is already in the memory of the driver.

In your case, a is a local Python variable and Spark knows nothing about it. What happens when you use parallelize is that the local variable (that's already in the memory) is wrapped in this nice data abstraction called RDD. It's simply a wrapper around the data that's already in memory on the driver. Spark can't do much about that. It's simply too late. But Spark plays nicely and pretends the data is as distributed as other datasets you could have processed using Spark.

That's why parallelize is only meant for small datasets to play around (and mainly for demos).

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
  • It might be surprising but this is not what happens here. In fact data is serialized and dumped to disk and at this point `a` could be safely removed. The problem is that JVM backend eagerly loads data from disk. – zero323 Sep 18 '17 at 18:45
1

Just like Jacek's answer, parallelize is only demo for small dataset, you can access all variables defined in driver within parallelize block.

Azuryy Yu
  • 11
  • 2
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Nov 03 '21 at 05:04
  • This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/30244764) – Emi OB Nov 03 '21 at 07:52