0

I have no practical experience in hadoop -- I have only learnt some theory. The task I am faced with is to process a huge CSV file (way larger than memory) using a cluster and I have come up with the following procedure.

Suppose the csv file contains 300 million lines and I call 1-100 million lines part1, 101-200 million lines part2, and 201-300 million lines part3. (This is only an example as in practice the data has to be partitioned into many more parts to be processed in memory)

I want to distribute the data onto the nodes in following way.

Node number Data taken

Node 1 Only part 1

Node 2 Only part 2

Node 3 Only part 3

Node 4 part 1 and 2

Node 5 part 2 and 3

Node 6 part 1 and 3

You see some nodes takes only one part of the data and some take 2. Depending on this, one of two functions are applied to each node. I learnt this can be done via an if-else statement in the reducer. i.e. my reducer should look like this

If (node 1,2,3) run function f1(data_block)

If (node 4,5,6) run function f2(data_blockA,data_blockB)

The problem is that most of the hadoop examples I have learnt do not allow each node to choose which part of the data they want to read. Data are distributed to nodes in a rather black-box way. Is there any way to get around this? P.S. I am thinking to rely on Hadoop stream as my primary language is Python, not Java, so this could be another constraint.

Anthon
  • 69,918
  • 32
  • 186
  • 246
nobody
  • 815
  • 1
  • 9
  • 24

1 Answers1

1

In HDFS architecture there is a concept of blocks. A typical block size used by HDFS is 64 MB. When we place a large file into HDFS it chopped up into 64 MB chunks(based on default configuration of blocks), Suppose you have a file of 1GB and you want to place that file in HDFS,then there will be 1GB/64MB = 16 split/blocks and these block will be distribute across the datanodes.

Data splitting happens based on file offsets.The goal of splitting of file is parallel processing and fail over of data.

These blocks/chunk will reside on a different DataNode based on your cluster configuration.Every block get assigned a block id and NameNode keep the information of the blocks for every file.

Suppose you have a file of 128MB and you want to write this file on HDFS.

The client machine first splits the file into block Say block A, Block B then client machine interact with name node and asks the location to write the block (Block A Block B).NameNode gives the list of available datanode to client to write the data.

Then client choose first datanode from those list and write the first block to the datanode and datanode replicates the block to another datanode once the writing process and replication completes the first datanode gives the acknowledgement about the blocks which it received.Then client writes the another block to datanode. NameNode keeps the information about files and their associated blocks.

When client make a request to read the data then again it make a request to NameNode first to get the data location of a specific file and then NameNode gives the block information to about the data to the client.

So you don't need to worry about data replacement on HDFS.

Answer to your question:

There is no other way to control data replacement policy on hadoop, but if you divide your file based upon HDFS block size(say block size is 64MB and your data size is 63MB) then one file will occupy one block and it will go on a specific datanode but again datanode will be chosen by NameNode. Later on you can check the datanode on which your file resides.

But placing small file on hadoop is not a efficient way to deal with hadoop because hadoop is designed to deal with very large datasets and small file can be overhead of NameNode. Please see this link for small file problem on Hadoop

Below link can be useful to know more about hadoop.

http://docs.spring.io/spring-hadoop/docs/2.0.4.RELEASE/reference/html/store.html

http://www.aosabook.org/en/hdfs.html

Community
  • 1
  • 1
Sandeep Singh
  • 7,790
  • 4
  • 43
  • 68
  • Thank you very much for your detailed comment. From what I understand from your words, - Data are split into chunks on HDFS - Name node distributes these trunks onto datanodes. I guess the key to my next question is if there is anyway I can instructs how the name nodes distribute the chunks? – nobody May 22 '15 at 15:31
  • Following my last comment,the reason is that my data is sorted on a particular attribute. It is crucial in my algorithm to distribute the data in such a way that it respect the order. e.g. Node 1 should contain all data with this attribute larger than all data in Node 2. So, say if the HDFS split my data, say 1GB into 16 chunks, then it is crucial for one of the node to get , say chunk 1-6, and the second node 7-12,.etc. And yes I do have very large dataset. about 40GB of a csv file.. – nobody May 22 '15 at 15:32
  • You cannot instruct to put data into specific node.I cold not understand your 2nd question.Can you elaborate. – Sandeep Singh May 22 '15 at 15:43
  • So my data is originally sorted according to, say, age. I want to distribute the data in such a way that Node 1 has everyone with age higher than people Node B. Is this possible? – nobody May 22 '15 at 16:00
  • No, but you can write your MapReduce program, Hive Query or pig program(your shorting logic) to read/process data in such a way. but its not possible to distribute the data in such a way that Node 1 has everyone with age higher than people Node B when you make a request to put data into hdfs. – Sandeep Singh May 22 '15 at 16:05
  • Thank you very much for your comment. I think I do not have enough knowledge to understand everything you have wrote. I will read more into the links you gave me in the future and in the meanwhile I will let the question wrong to see if someone could provide an answer for people like me to understand... thank you very much for your input again. – nobody May 22 '15 at 19:03