I am planning to use Spark to process data where each individual element/row in the RDD or DataFrame may occasionally be large (up to several GB).
The data will probably be stored in Avro files in HDFS.
Obviously, each executor must have enough RAM to hold one of these "fat rows" in memory, and some to spare.
But are there other limitations on row size for Spark/HDFS or for the common serialisation formats (Avro, Parquet, Sequence File...)? For example, can individual entries/rows in these formats be much larger than the HDFS block size?
I am aware of published limitations for HBase and Cassandra, but not Spark...