Maximum size of rows in Spark jobs using Avro/Parquet

Question

I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).

The data will probably be stored in Avro files in HDFS.

Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.

But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?

I am aware of published limitations for HBase and Cassandra, but not Spark...

score 2 · Answer 1 · edited May 23 '17 at 12:08

There are currently some fundamental limitations related to block size, both for partitions in use and for shuffle blocks - both are limited to 2GB, which is the maximum size of a ByteBuffer (because it takes an int index, so is limited to Integer.MAX_VALUE bytes).

The maximum size of an individual row will normally need to be much smaller than the maximum block size, because each partition will normally contain many rows, and the largest rows might not be evenly distributed among partitions - if by chance a partition contains an unusually large number of big rows, this may push it over the 2GB limit, crashing the job.

See:

Why does Spark RDD partition has 2GB limit for HDFS?

Related Jira tickets for these Spark issues:

Maximum size of rows in Spark jobs using Avro/Parquet

1 Answers1

Linked