2

I need to stored a large file of about 10TB on HDFS. What i need to understand is how HDFS will store this file. Say, The replication factor for the cluster is 3 and I have a 10 node cluster with over 10 TB of disk space on each node i.e total cluster capacity is over 100TB.

Now does HDFS choose three nodes at random and store the file on these three nodes. So then this is as simple as it sounds. Please confirm?

Or does HDFS split the file - say in to 10 split of 1TB each and then stores each of the split on 3 nodes chosen at random. So is spliting possible and if yes is it a configuration aspect through which it is enabled. And if HDFS has to split a binary or text file - how does it split. Simply by bytes.

Community
  • 1
  • 1
samshers
  • 1
  • 6
  • 37
  • 84
  • 2
    Unless the format you're going to use is splitable, this is a bad idea. From HDFS's perspective it doesn't matter, but for MapReduce if it isn't splitable only one mapper will be able to process said file. – Binary Nerd Nov 14 '16 at 18:42

1 Answers1

8

Yes, it splits the file (in 128mb blocks by default). Every block will be stored on 3 random nodes. As a result you'll have 30TB of data evenly distributed over your 10 nodes.

facha
  • 11,862
  • 14
  • 59
  • 82
  • 1
    can you add more detail about how the split is done - if byte bye byte or some other mechanism. Why i need to know this is - if i write a map reduce program, how does hadoop know about what data is located on which node and so on about data locality. – samshers Nov 14 '16 at 16:35
  • 2
    The name node manages the metadata about the all the different blocks a file has been split into, where each block is (on which data node) and where the replicas are made. block size and replication factor can be configured. Splitting the file is done by the client that you employ to write the file to HDFS. If a single line is greater than the block size then the line would still be split and placed in two blocks. see this link in which it is explained in great detail - http://stackoverflow.com/questions/14291170/how-does-hadoop-process-records-split-across-block-boundaries – Gopi Kolla Nov 15 '16 at 01:10