What kind of JBOD in hadoop? and COW with hadoop?

Question

New to hadoop, only setup a 3 debian server cluster for practice.

I was researching best practices on hadoop and came across: JBOD no RAID Filesystem: ext3, ext4, xfs - none of that fancy COW stuff you see with zfs and btrfs

So I raise these questions...

Everywhere I read JBOD is better then RAID in hadoop, and that the best filesystems are xfs and ext3 and ext4. Aside from the filesystem stuff which totally makes sense why those are the best... how do you implement this JBOD? You will see my confusion if you do the google search your self, JBOD alludes to a linear appendage or combination of just a bunch of disks kind of like a logical volume well at least thats how some people explain it, but hadoop seems to want a JBOD that doesnt combine. No body expands on that...

Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?
Question 2) Is it as simple as mounting each disk to a different directory is all?
Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?
Question 4) And then you just point hadoop at those data.dirs?
Question5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?

This is off topic, but can someone verify if this is correct as well? Filesystems that use cow, copy on write, like zfs and btrfs just slow down hadoop but not only that the cow implementation is a waste with hadoop.

Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...
Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?

REASKING QUESTION 1 THRU 4

I just realized my question is so simple yet it's so hard for me to explain it that I had to split it up into 4 questions, and i still didn't get the answer I'm looking for, from what sounds like very smart individuals, so i must re-ask differently..
On paper I could easily or with a drawing... I'll attempt with words again..
If confused on what I am asking in the JBOD question...
** just wondering what kind of JBOD everyone keeps referring to in the hadoop world is all **
JBODs are defined differently with hadoop then in normal world and I want to know how the best way to implement hadoop is it on a concat of jbods(sda+sdb+sdc+sdd) or just leave the disks alone(sda,sdb,sdc,sdd)
I think the graphical representation below explain what I am asking best

(JBOD METHOD 1)

normal world: jbod is a concatination of disks - then if you were to use hadoop you would overlay the data.dir (Where hdfs virtualy sites) onto a directory inside this concat of disks, ALSO all of the disks would appear as 1... so if you had sda and sdb and sdc as your data disks in your node, you would make em appear as some entity1 (either with the hardware of the motherboard or mdadm or lvm) which is a linear concat of sda and sdb and sdc. you would then mount this entity1 to a folder in the Unix namespace like /mnt/jbod/ and then setup hadoop to run with in it.
TEXT SUMMARY: if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then this jbod would be 600gb big, and hadoop from this node would gain 600gb in capacity

* TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD: * disk1 2 and 3 used for datanode for hadoop * disk1 is sda 100gb * disk2 is sdb 200gb * disk3 is sdc 300gb * sda + sdb + sdc = jbod of name entity1 * JBOD MADE ANYWAY - WHO CARES - THATS NOT MY QUESTION: maybe we made the jbod of entity1 with lvm, or mdadm using linear concat, or hardware jbod drivers which combine disks and show them to the operating system as entity1, it doesn't matter, either way its still a jbod * This is the type of JBOD I am used to and I keep coming across when I google search JBOD * cat /proc/partitions would show sda,sdb,sdc and entity1 OR if we used hardware jbod maybe sda and sdb and sdc would not show and only entity1 would show, again who cares how it shows * mount entity1 to /mnt/entity1 * running "df" would show that entity1 is 100+200+300=600gb big * we then setup hadoop to run its datanodes on /mnt/entity1 so that datadir property points at /mnt/entity1 and the cluster just gained 600gb of capacity

..the other perspective is this..

(JBOD METHOD 2)

in hadoop it seems to me they want every disk seperate. So I would mount disk sda and sdb and sdc in the unix namespace to /mnt/a and /mnt/b and /mnt/c... it seems from reading across the web lots of hadoop experts classify jbods as just that just a bunch of disks so to unix they would look like disks not a concat of the disks... and then of course i can combine then to become one entity either with logical volume manager (lvm) or mdadm (in a raid or linear fashion, linear prefered for jbod) ...... but...... nah lets not combine them because it seems in the hadoop world jbod is just a bunch of disks sitting by them selves...
if disk 1 and disk2 and disk 3 were each 100gb and 200gb and 300gb big respectively then each mount disk1->/mnt/a and disk2->/mnt/b and disk3->/mnt/c would each be 100gb and 200gb and 300gb big respectively, and hadoop from this node would gain 600gb in capacity

TEXTO-GRAPHICAL OF LINEAR CONCAT OF DISKS BEING A JBOD * disk1 2 and 3 used for datanode for hadoop * disk1 is sda 100gb * disk2 is sdb 200gb * disk3 is sdc 300gb * WE DO NOT COMBINE THEM TO APPEAR AS ONE * sda mounted to /mnt/a * sdb mounted to /mnt/b * sdc mounted to /mnt/c * running a "df" would show that sda and sdb and sdc have the following sizes: 100,200,300 gb respectively * we then setup hadoop via its config files to lay its hdfs on this node on the following "datadirs": /mnt/a and /mnt/b and /mnt/c.. gaining 100gb to the cluster from a, 200gb from b and 300gb from c... for a total gain of 600gb from this node... nobody using the cluster would tell the difference..

SUMMARY OF QUESTION

** Which method is everyone referring to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation? **

Both cases would gain hadoop 600gb... its just 1. looks like a concat or one entity that is a combo of all the disks, which is what I always thought was a jbod... Or it will be like 2 where each disk in the system is mounted to different directory, end result is all the same to hadoop capacity wise... just wondering if this is the best way for performance

SSaikia_JtheRocker · Answer 1 · 2013-07-19T18:02:14.000

I can try to answer few of the questions - tell me wherever you disagree.

1.JBOD: Just a bunch of disks; an array of drives, each of which is accessed directly as an independent drive. From Hadoop Definitive Guide, topic Why not use RAID?, says that RAID Read and Write performance is limited by the slowest disk in the Array. In addition, in case of HDFS, replication of data occurs across different machines residing in different racks. This handle potential loss of data even if a rack fails. So, RAID isn't that necessary. Namenode can though use RAID as mentioned in the link.

2.Yes That means independent disks (JBODs) mounted in each of the machines (e.g. /disk1, /disk2, /disk3 etc.) but not partitioned.

3, 4 & 5 Read Addendum

6 & 7. Check this link to see how replication of blocks occurs

Addendum after the comment:

Q1. Which method is everyone refering to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation?

Possible answer: From Hadoop Definitive Guide -

You should also set the dfs.data.dir property, which specifies a list of directories for a datanode to store its blocks. Unlike the namenode, which uses multiple directories for redundancy, a datanode round-robins writes between its storage directories, so for performance you should specify a storage directory for each local disk. Read performance also benefits from having multiple disks for storage, because blocks will be spread across them, and concurrent reads for distinct blocks will be correspondingly spread across disks.

For maximum performance, you should mount storage disks with the noatime option. This setting means that last accessed time information is not written on file reads, which gives significant performance gains.

Q2. Why LVM is not a good idea?

Avoid RAID and LVM on TaskTracker and DataNode machines – it generally reduces performance.

This is because LVM creates logical layer over the individual mounted disks in a machine.

Check this link for TIP 1 more details. There are use cases where using LVM performed slow when running Hadoop jobs.

ehh question 1 thru 4 got misunderstood, i reedited my questions above to include a new section that attempts to reask it — kossboss, Jul 18 '13 at 03:19
I have enhanced my answer a little bit more. I hope it helps! — SSaikia_JtheRocker, Jul 19 '13 at 18:08
@user2580961: I'm not Sir since, I'm no expert and still learning. You can accept the answer, if it helped you and if you think it's quite appropriate. Thank. — SSaikia_JtheRocker, Jul 30 '13 at 21:27

score 5 · Answer 2 · answered Dec 21 '14 at 23:40

I'm late to the party but maybe I can chime in:

JBOD

Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?

Just a bunch of disks... you just format the whole disk and include it in the ´hdfs-site.xmlandmapred-site.xmloryarn-site-xml` on the datanodes. Hadoop takes care about distributing blocks across disks.

Question 2) Is it as simple as mounting each disk to a different directory is all?

Yes.

Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?

Yes. Hadoop does checksumming of the data and periodically verifies these checksums.

Question 4) And then you just point hadoop at those data.dirs?

Exactly. But there are directories for data storage (HDFS) and computation (MapReduce, YARN, ..) you can configure different directories and disks for certain tasks.

Question 5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm --linear mode, or lvm i bet can do it too, so I dont see the big deal with that... And if thats the case, where mdadm --linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to "JBOD" or linearly concat disks for hadoop?

The problem is faulty disks. If you keep it simple and just mount each disks at a time you just have to replace this disk. If you use mdadm or LVM in ja JBOD configuration you have are prone to loosing more data in case a disk dies as the striped or concat configuration may not survive a disk failure. As data for more blocks is spread across multiple disks.

Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode)... Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage... Idk...

HDFS is a competently seperate layer atop of your native filesystem. Disk failures are expected and that's why all data blocks are replicated at least 3 times across several machines. HDFS also does it's own checksumming so if the checksum of a block mismatches a replica of this block is used and the broken block will be deleted by HDFS.

So in theory it makes no sense to use RAID or COW for Hadoop drives.

It can make sense through if you have to deal with faulty disks that can't be replaced instantly.

Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be ??? how does this topic?

The NameNode has a list of blocks and their locations on the datanodes. Each block has a checksum and locations. If a datanode goes down in a cluster the namenode replicates the blocks of this datanode to other datanodes.

If an older datanode comes online it is sending it's block list to the NameNode and depending on how many of the blocks are already replicated or not it will delete unneeded blocks on this datanode.

The age of data is not important it's only about blocks. If the NameNode still maintains the blocks and the datanode has them, they will be used again.

ZFS/btrfs/COW

In theory they additional features these filesystems provide are not required for Hadoop. However as you usally use cheap and huge 4TB+ desktop drives for datanodes you can run into problems if these disks start to fail.

ext4 remounts itself read-only on failure and at this point you'll see the disk dropping out of the HDFS on the datanode it it's configured to loose drives or you'll see the datanode die if disk failures that are not allowed. This can be a problem because modern drives often exhibit some bad sectors while still functioning fine for the most part and it's work intensive to fsck this disks and restart the datanode.

Another problem are computations through YARN/MapReduce.. these write also intermediate data on the disks and if this data gets corrupted or can't be written you'll run into errors. I'm not sure if YARN/MapReduce also checksum their temporary files - I think it's implemented through.

ZFS and btrfs provide some resilience against this errors on modern drives as they are able to deal better with corrupted metadata and avoid lengthy fsck checks due to internal checksumming.

I'm running a Hadoop cluster on ZFS (just JBOD with LZ4) with lots of disks that exhibitit some bad sectors and that are out of warranty but still perform well and it works fine despite these errors.

If you can replace the faulty disks instantly it does not matter much. If you need to live with partly broken disks ZFS/btrfs will buy you some time before having to replace the disks.

COW is not needed because Hadoop takes care of replication and security. Compression can be usefull if you store your data uncompressed on the cluster. LZ4 in ZFS should not provide a performance penalty and can speed up sequential reads (like HDFS and MapReduce do them).

Performance

The case against RAID is that at least MapReduce is implementing something similiar.. HDFS can read and write concurrently to all of the disks and usally multiple map and reduce jobs are running that can use a whole disk for writing and reading their data.

If you put RAID or striping below Hadoop these jobs have all to enqueue their reads and write to the single RAID controller and overall it's likely slower.

Depending on your jobs it can make sense to use something like RAID-0 for pairs of disks but be sure to first verify that sequential read or write is really the bottleneck for your job (and not the network, HDFS replication, CPU, ..) but make sure first that what you are doing is worth the work and the hassle.