Is any significant difference in data partitioning when working with
Hadoop/mapreduce and Spark?
Spark supports all hadoop I/O formats as it uses same Hadoop InputFormat APIs along with it's own formatters. So, Spark input partitions works same way as Hadoop/MapReduce input splits by default. Data size in a partition can be configurable at run time and It provides transformation like repartition
, coalesce
, and repartitionAndSortWithinPartition
can give you direct control over the number of partitions being computed.
Are there any cases where their procedure of data partitioning can
differ?
Apart from Hadoop, I/O APIs Spark does have some other intelligent I/O Formats(Ex: Databricks CSV and NoSQL DB Connectors) which will directly return DataSet/DateFrame
(more high-level things on top of RDD) which are spark specific.
Key points on spark partitions when reading data from Non-Hadoop sources
- The maximum size of a partition is ultimately by the connectors,
- for S3, the property is like
fs.s3n.block.size
or fs.s3.block.size
.
- Cassandra property is
spark.cassandra.input.split.size_in_mb
.
- Mongo prop is,
spark.mongodb.input.partitionerOptions.partitionSizeMB
.
- By default the number of partitions
is the
max(sc.defaultParallelism, total_data_size / data_block_size)
.
some times number of available cores in the cluster also imflunce the number of partitions like sc.parallelize()
without partitions param.
Read more..
link1