I am using Spark's Python API and running Spark 0.8.
I am storing a large RDD of floating point vectors and I need to perform calculations of one vector against the entire set.
Is there any difference between slices and partitions in an RDD?
When I create the RDD, I pass it 100 as a parameter which causes it to store the RDD as 100 slices and create 100 tasks when performing the calculations. I want to know if partitioning the data would improve performance beyond the slicing by enabling the system to process the data more efficiently (i.e. is there a difference between performing operations over a partition versus over just operating over every element in the sliced RDD).
For example, is there any significant difference between these two pieces of code?
rdd = sc.textFile(demo.txt, 100)
vs
rdd = sc.textFile(demo.txt)
rdd.partitionBy(100)