Concept of blocks in Hadoop HDFS

Question

I have some questions regarding the blocks in Hadoop. I read that Hadoop uses HDFS which will creates blocks of specific size.

First Question Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

Second Question Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

score 1 · Answer 1 · answered Nov 27 '16 at 17:29

Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

Yes, the blocks exist physically on disk across the datanodes in your cluster. I suppose you could "see" them if you were on one of the datanodes and you really wanted to, but it would likely not be illuminating. It would only be a random 128m (or whatever dfs.block.size is set to in hdfs-site.xml) fragment of the file with no meaningful filename. The hdfs dfs commands enable you to treat HDFS as a "real" filesystem.

Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Hadoop takes care of splitting the file into blocks and distributing them among the datanodes when you put a file in HDFS (through whatever method applies to your situation).

Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

Not entirely sure what you mean, but the blocks exist before, and irrespective of, any processing you do with them.

Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

Again, blocks in HDFS are determined before any processing is done, if any is done at all. HDFS is simply a way to store a large file in a distributed fashion. When you do processing, for example with a MapReduce job, Hadoop will write intermediate results to disk. This is not related to the blocking of the raw file in HDFS.

Great answer thanks. Regarding my third question, you know when you execute a MapReduce Job there is InputFormat class which provides the splits(Like FileSplit) and the RecordReader. So my question was: do the already exist blocks on the datanodes (which what you explained to me) get any change (in size or new generated blocks) after the splitting, OR the splitting is only about logically(not physically) mentioning the stake/portion for each mapper. — Mosab Shaheen, Nov 27 '16 at 18:19
if you could also answer my question on: http://stackoverflow.com/questions/40829509/about-hadoop-directory-as-input-combinefilesplit-number-of-mappers-datanodes I will be grateful. — Mosab Shaheen, Nov 27 '16 at 18:23
@MosabShaheen I see what you mean now. I think this depends upon a lot of things; see [this answer](http://stackoverflow.com/a/17856292/4601931) for a good explanation. As for your other SO post, I'm not super familiar with the Java MapReduce API, so I don't think I can be of help. Sorry. — user4601931, Nov 27 '16 at 18:32
Thanks, I think the splits doesn't change the blocks but it specifies where each mapper should exist on the cluster for locality of data or rack awareness because the DataNodes are already running and are having the blocks so the splits I think will till Hadoop to run the mapper beside the node that contains the data. If you notice that inside the split there is the location parameter I think for this purpose — Mosab Shaheen, Nov 27 '16 at 19:59
@MosabShaheen That is wonderful. Thank you for notifying me. — user4601931, Dec 02 '16 at 18:57

score 1 · Accepted Answer · edited May 23 '17 at 12:30

1.Are the blocks physically exist on the Harddisk on the normal file system like NTFS i.e. can we see the blocks on the hosting filesystem (NTFS) or only it can be seen using the hadoop commands?

Yes. Blocks exist physically. You can use commands like hadoop fsck /path/to/file -files -blocks

Refer below SE questions for commands to view blocks :

Viewing the number of blocks for a file in hadoop

2.Does hadoop create the blocks before running the tasks i.e. blocks exist from the beginning whenever there is a file, OR hadoop creates the blocks only when running the task.

Hadoop = Distributed storage ( HDFS) + Distributed processing ( MapReduce & Yarn).

A MapReduce job works on input splits => The input splits are are created from Data blocks in Datanodes. Data blocks are created during write operation of a file. If you are running a job on existing files, data blocks are pre-creared before the job and InputSplits are created during Map operation. You can think data block as physical entity and InputSplit as logical entity. Mapreduce job does not change input data blocks. Reducer generates output data as new data blocks.

Mapper process input splits and emit output to Reducer job.

3.Third Question Will the blocks be determined and created before splitting (i.e. getSplits method of InputFormat class) regardless of the number of splits or after depending on the splits?

Input is already available with physicals DFS blocks. A MapReduce job works in InputSplit. Blocks and InputSplits may or may not be same. Block is a physical entity and InputSplit is logical entity. Refer to below SE question for more details :

How does Hadoop perform input splits?

4.Forth Question Are the blocks before and after running the task same or it depends on the configuration, and is there two types of blocks one for storing the files and one for grouping the files and sending them over network to data nodes for executing the task?

Mapper input : Input blocks pre-exists. Map process starts on input blocks/splits, which have been stored in HDFS before commencement of Mapper job.

Mapper output : Not stored in HDFS and it does not make sense to store intermediate results on HDFS with replication factor of X more than 1.

Reducer output: Reducer output is stored in HDFS. Number of blocks will depend on size of reducer output data.

Thanks dear for replying, Regarding my first question I meant if i can see the blocks from the hosting file system like Explorer in Windows. Regarding second question you said "The input splits are existing in Datanodes" I think you mixed here between splits(logical) and blocks(physical) because the blocks exist in Datanodes not the splits. — Mosab Shaheen, Nov 29 '16 at 17:26
Still my major question is not answered: are the blocks before spliting (i.e. executing getSplits from InputFormat class and before running the mappers) same as after splitting (i.e. after executing getSplits and before running the mappers) If you know please tell me and thanks for your cooperation. — Mosab Shaheen, Nov 29 '16 at 17:27

Concept of blocks in Hadoop HDFS

2 Answers2

Linked