Hadoop backup and recovery tool and guidance

Question

I am new to hadoop need to learn details about backup and recovery. I have revised oracle backup and recovery will it help in hadoop?From where should I start

Yes, you should learn about back up and recovery process of Hadoop. Please see the post related to it. http://stackoverflow.com/questions/28038121/hadoop-disaster-recovery-and-prevent-data-loss — Sandeep Singh, May 14 '15 at 13:46

score 6 · Answer 1 · answered May 14 '15 at 15:31

There are a few options for backup and recovery. As s.singh points out, data replication is not DR.

HDFS supports snapshotting. This can be used to prevent user errors, recover files, etc. That being said, this isn't DR in the event of a total failure of the Hadoop cluster. (http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsSnapshots.html)

Your best bet is keeping off-site backups. This can be to another Hadoop cluster, S3, etc and can be performed using distcp. (http://hadoop.apache.org/docs/stable1/distcp2.html), (https://wiki.apache.org/hadoop/AmazonS3)

Here is a Slideshare by Cloudera discussing DR (http://www.slideshare.net/cloudera/hadoop-backup-and-disaster-recovery)

Kumar · Answer 2 · 2017-03-16T04:00:39.007

2

Hadoop is designed to work on the big cluster with 1000's of nodes. Data loss is possibly less. You can increase the replication factor to replicate the data into many nodes across the cluster.

Refer Data Replication

For Namenode log backup, Either you can use the secondary namenode or Hadoop High Availability

Secondary Namenode

Secondary namenode will take backup for the namnode logs. If namenode fails then you can recover the namenode logs (which holds the data block information) from the secondary namenode.

High Availability

High Availability is a new feature to run more than one namenode in the cluster. One namenode will be active and the other one will be in standby. Log saves in both namenode. If one namenode fails then the other one becomes active and it will handle the operation.

But also we need to consider for Backup and Disaster Recovery in most cases. Refer @brandon.bell answer.

edited Mar 16 '17 at 04:00

answered May 14 '15 at 09:56

Kumar

3,782
4
39
87

1

@Kumar- Replication is not designed for disaster recovery. Data replication is only useful in the case of node failure. Even High availability cluster is not designed for disaster recovery. It ensure the availability of your cluster.When we deal with sensitive data we should care about the back up and recovery. Please see my previous post for several approach of disaster recovery. http://stackoverflow.com/questions/28038121/hadoop-disaster-recovery-and-prevent-data-loss – Sandeep Singh May 14 '15 at 13:43
2

There's always a need for backups. At the very least, you need to be able to protect against the logical loss of data. Day 1, Bob is told 'purge that stuff we don't need it', Day 5 someone asks Bob where all the useful data went. Replication isn't sufficient if it replicates deletions. – EightBitTony Jan 05 '16 at 13:32
1

In addition to protecting from user errors, logical loss of data, you need some sort of backups for internal audit/compliance purposes e.g you have to keep backups for a certain number of months/years depending on your industry – JStorage Mar 22 '16 at 19:16

ashwin111 · Answer 3 · 2016-11-15T18:49:46.890

0

You can use the HDFS sync application on DataTorrent for DR use cases to backup high volumes of data from one HDFS cluster to another.

https://www.datatorrent.com/apphub/hdfs-sync/

It uses Apache Apex as a processing engine.

edited Nov 15 '16 at 18:49

answered Nov 15 '16 at 01:32

ashwin111

146
1
4

score 0 · Answer 4 · edited May 23 '17 at 11:54

Start with official documentation website : HdfsUserGuide

Have a look at below SE posts:

Hadoop 2.0 data write operation acknowledgement

Hadoop: HDFS File Writes & Reads

Hadoop 2.0 Name Node, Secondary Node and Checkpoint node for High Availability

How does Hadoop Namenode failover process works?

Documentation page regarding Recovery_Mode:

Typically, you will configure multiple metadata storage locations. Then, if one storage location is corrupt, you can read the metadata from one of the other storage locations.

However, what can you do if the only storage locations available are corrupt? In this case, there is a special NameNode startup mode called Recovery mode that may allow you to recover most of your data.

You can start the NameNode in recovery mode like so: namenode -recover

Hadoop backup and recovery tool and guidance

4 Answers4

Linked