Let's say I have a cluster of 4 nodes
each having 1 core
. I have a 600 Petabytes
size big file which I want to process through Spark
. File could be stored in HDFS
.
I think that way to determine no. of partitions is file size / total no. of cores in the cluster. If that is the case indeed, I will have 4 partitions
(600/4) so each partition will be of 125 PB
size.
But I think 125 PB
is too big a size for partition so is my thinking correct related to deducing no. of partitions.
PS: I have just started with Apache Spark
. So, apologies if this is a naive question.