Given that the HashPartitioner docs say:
[HashPartitioner] implements hash-based partitioning using Java's Object.hashCode.
Say I want to partition DeviceData
by its kind
.
case class DeviceData(kind: String, time: Long, data: String)
Would it be correct to partition an RDD[DeviceData]
by overwriting the deviceData.hashCode()
method and use only the hashcode of kind
?
But given that HashPartitioner
takes a number of partitions parameter I am confused as to whether I need to know the number of kinds in advance and what happens if there are more kinds than partitions?
Is it correct that if I write partitioned data to disk it will stay partitioned when read?
My goal is to call
deviceDataRdd.foreachPartition(d: Iterator[DeviceData] => ...)
And have only DeviceData
's of the same kind
value in the iterator.