I need to stored a large file of about 10TB on HDFS. What i need to understand is how HDFS will store this file. Say, The replication factor for the cluster is 3 and I have a 10 node cluster with over 10 TB of disk space on each node i.e total cluster capacity is over 100TB.
Now does HDFS choose three nodes at random and store the file on these three nodes. So then this is as simple as it sounds. Please confirm?
Or does HDFS split the file - say in to 10 split of 1TB each and then stores each of the split on 3 nodes chosen at random. So is spliting possible and if yes is it a configuration aspect through which it is enabled. And if HDFS has to split a binary or text file - how does it split. Simply by bytes.