Why do we need to format HDFS after every time we restart machine?

Question

I have installed Hadoop in pseudo distributed mode on my laptop, OS is Ubuntu.

I have changed paths where hadoop will store its data (by default hadoop stores data in /tmp folder)

hdfs-site.xml file looks as below :

<property>
    <name>dfs.data.dir</name>
    <value>/HADOOP_CLUSTER_DATA/data</value>
</property>

Now whenever I restart machine and try to start hadoop cluster using start-all.sh script, data node never starts. I confirmed that data node is not start by checking logs and by using jps command.

Then I

Stopped cluster using stop-all.sh script.
Formatted HDFS using hadoop namenode -format command.
Started cluster using start-all.sh script.

Now everything works fine even if I stop and start cluster again. Problem occurs only when I restart machine and try to start the cluster.

Has anyone encountered similar problem?
Why this is happening and
How can we solve this problem?

which is your default location now? – vishnu viswanath Nov 22 '13 at 10:48 — vishnu viswanath, Nov 22 '13 at 10:48
@sonic, I have modified my question as per your comment. – Shekhar Nov 22 '13 at 11:24 — Shekhar, Nov 22 '13 at 11:24
You have the answer below :) – vishnu viswanath Nov 22 '13 at 12:10 — vishnu viswanath, Nov 22 '13 at 12:10

score 8 · Accepted Answer · answered Nov 22 '13 at 11:57

By changing dfs.datanode.data.dir away from /tmp you indeed made the data (the blocks) survive across a reboot. However there is more to HDFS than just blocks. You need to make sure all the relevant dirs point away from /tmp, most notably dfs.namenode.name.dir (I can't tell what other dirs you have to change, it depends on your config, but the namenode dir is mandatory, could be also sufficient).

I would also recommend using a more recent Hadoop distribution. BTW, the 1.1 namenode dir setting is dfs.name.dir.

score 4 · Answer 2 · edited May 23 '17 at 11:46

For those who use hadoop 2.0 or above versions config file names may be different.

As this answer points out, go to the /etc/hadoop directory of your hadoop installation.

Open the file hdfs-site.xml. This user configuration will override the default hadoop configurations, that are loaded by the java classloader before.

Add dfs.namenode.name.dir property and set a new namenode dir (default is file://${hadoop.tmp.dir}/dfs/name).

Do the same for dfs.datanode.data.dir property (default is file://${hadoop.tmp.dir}/dfs/data).

For example:

<property>
    <name>dfs.namenode.name.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/name</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/data</value>
</property>

Other property where a tmp dir appears is dfs.namenode.checkpoint.dir. Its default value is: file://${hadoop.tmp.dir}/dfs/namesecondary.

If you want, you can easily also add this property:

<property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>/Users/samuel/Documents/hadoop_data/namesecondary</value>
</property>

After doing this, don't forget to format your file system, like: `$HADOOP_HOME/bin/hdfs namenode -format`. Replace $HADOOP_HOME environment variable with your absolute path to hadoop, if var is not set. — Samuel, Jun 29 '16 at 13:06

Why do we need to format HDFS after every time we restart machine?

2 Answers2

Linked