Is there any relationship between the number of elements an RDD contained and its ideal number of partitions ?
I have a RDD that has thousand of partitions (because I load it from a source file composed by multiple small files, that's a constraint I can't fix so I have to deal with it). I would like to repartition it (or use the coalesce
method). But I don't know in advance the exact number of events the RDD will contain.
So I would like to do it in an automated way. Something that will look like:
val numberOfElements = rdd.count()
val magicNumber = 100000
rdd.coalesce( numberOfElements / magicNumber)
Is there any rule of thumb about the optimal number of partition of a RDD and its number of elements ?
Thanks.