How large can a single RDD record be?

Question

I have a RDD like this:

val graphInfo: RDD[(Long, Int, Long, Long, Iterable[Long])]

The node is represented by a Long type integer and will be stored in the Iterable[Long] of graphInfo. How many elements can be contained in that Iterable? What are the limits, if any, for the size of an individual RDD record?

The length of `Iterable` is not bounded. It may be infinite. — Chris Martin, Mar 09 '16 at 07:33
I just not sure whether too many elements in a Iterable of a RDD will make spark crash . — user1803467, Mar 09 '16 at 07:39
That's a different question, about Spark, not Scala. I doubt there's any fixed limit, but eventually you'd run out of memory on a node. Really big data aggregates should be RDDs themselves, not a single entry. What's the use-case? — The Archetypal Paul, Mar 09 '16 at 08:34
We want to hierarchically cluster a huge graph. In each step we need to store the nodes in the cluster for the next partitioning. Each cluster has an entry in RDD and the entry contains all the nodes stored within an iterable[Long]. — user1803467, Mar 09 '16 at 09:51
Well that's the wrong way to model your graph, then. You'll gain nothing from using Spark if your RDD only contains a few rows, each really large. — The Archetypal Paul, Mar 10 '16 at 14:09
Thanks a lot. We are going to change the storing format of assignment for node to groups to a RDD. Each row only contains a group Id and a node Id. In each clustering we can get the nodes of a group with join operation. — user1803467, Mar 11 '16 at 03:21

score 2 · Answer 1 · answered Mar 09 '16 at 09:49

As already suggested, there's no limit to the number of elements.

There might be, however, limits to the amount of memory used by a single RDD record: Spark limits the maximum partition size to 2GB (see SPARK-6235). Each partition is a collection of records, so theoretically the upper limit for a record is 2GB (reaching this limit when each partition contains a single record).

In practice, records exceeding a few megabytes are discouraged, because the limit mentioned above will probably force you to artificially increase the number of partitions beyond what would otherwise be the optimum. All of Spark's optimization considerations aim for handling as many records as you wish (given sufficient resources), not for handling records as large as you wish.

score 1 · Answer 2 · edited May 23 '17 at 12:23

How many elements can be contained in that Iterable

An iterable can contain possibly infinite elements. If the iterable is coming from a streaming source, for example, as long as that streaming source is available, you'll be receiving elements.

I just not sure whether too many elements in a Iterable of a RDD will make spark crash

That depends, again, on how you populate the iterable. If your spark job has sufficient memory, you should be fine. The best way for you to find out is a simply by trail and error, and also by understanding sparks limitations for RDD's memory size

How large can a single RDD record be?

2 Answers2