How to configure the erasure coding feature in hadoop3 and is it used for storing cold files only by default?

Question

As per the Hadoop 3.x release notes, they have introduced Erasure coding to overcome the problems with storage.

Erasure coding is a method for durably storing data with significant space savings compared to replication. Standard encodings like Reed-Solomon (10,4) have a 1.4x space overhead, compared to the 3x overhead of standard HDFS replication.

Since erasure coding imposes additional overhead during reconstruction and performs mostly remote reads, it has traditionally been used for storing colder, less frequently accessed data. Users should consider the network and CPU overheads of erasure coding when deploying this feature.

I am looking for the sample configuration files for the same.

Also, even after setting up the ec policy and enabling it using hdfs ec-enablePolicy, does the policy work for cold files only or it is by default implemented to store the entire hdfs files?

unwelcomed_user · Accepted Answer · 2018-03-23T09:49:26.807

In hadoop3 we can enable Erasure coding policy to any folder in HDFS.

Command to List the supported erasure policies:

./bin/hdfs ec -listPolicies

Command to Enable XOR-2-1-1024k Erasure policy:

./bin/hdfs ec -enablePolicy -policy XOR-2-1-1024k

Command to Set Erasure policy to HDFS directory:

./bin/hdfs ec -setPolicy -path /tmp -policy XOR-2-1-1024k

Command to Get the policy set to the given directory:

./bin/hdfs ec -getPolicy -path /tmp

Command to Remove the policy from the directory.i.e unset policy:

./bin/hdfs ec -unsetPolicy -path /tmp

Command to Disable policy:

./bin/hdfs ec -disablePolicy -policy XOR-2-1-1024k

Edit:

A sample EC policy XML file named user_ec_policies.xml.template is in the Hadoop conf directory($HADOOP_HOME/etc/hadoop/) available for reference.

By default REPLICATION policy is always enabled. Erasure coding policy are disabled by default.

Erasure coding apply for only selected HDFS path, for example if you select /erasure_code_data as your path when setting policy then EC apply only for this directory. And other file already present in HDFS like /tmp /user has REPLICATION policy.

Thanks, @karthik but I was also looking for the configuration file of the ec policy `user_ec_policies.xml`. I need help in adding the right properties in the xml file. — Nitesh Gupta, Mar 23 '18 at 08:08
You can refer sample EC policy XML file user_ec_policies.xml.template in the Hadoop conf directory($HADOOP_HOME/etc/hadoop/). — unwelcomed_user, Mar 23 '18 at 09:29

How to configure the erasure coding feature in hadoop3 and is it used for storing cold files only by default?

1 Answers1