Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

Question

I am trying to migrate existing data (JSON) in my Hadoop cluster to Google Cloud Storage.

I have explored GSUtil and it seems that it is the recommended option to move big data sets to GCS. It seems that it can handle huge datasets. It seems though that GSUtil can only move data from Local machine to GCS or S3<->GCS, however cannot move data from local Hadoop cluster.

What is a recommended way of moving data from local Hadoop cluster to GCS ?
In case of GSUtil, can it directly move data from local Hadoop cluster(HDFS) to GCS or do first need to copy files on machine running GSUtil and then transfer to GCS?
What are the pros and cons of using Google Client Side (Java API) libraries vs GSUtil?

Thanks a lot,

score 14 · Accepted Answer · edited Sep 12 '19 at 20:16

14

Question 1: The recommended way of moving data from a local Hadoop cluster to GCS is to use the Google Cloud Storage connector for Hadoop. The instructions on that site are mostly for running Hadoop on Google Compute Engine VMs, but you can also download the GCS connector directly, either gcs-connector-1.2.8-hadoop1.jar if you're using Hadoop 1.x or Hadoop 0.20.x, or gcs-connector-1.2.8-hadoop2.jar for Hadoop 2.x or Hadoop 0.23.x.

Simply copy the jarfile into your hadoop/lib dir or $HADOOP_COMMON_LIB_JARS_DIR in the case of Hadoop 2:

cp ~/Downloads/gcs-connector-1.2.8-hadoop1.jar /your/hadoop/dir/lib/

You may need to also add the following to your hadoop/conf/hadoop-env.sh file if youre running 0.20.x:

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/your/hadoop/dir/lib/gcs-connector-1.2.8-hadoop1.jar

Then, you'll likely want to use service-account "keyfile" authentication since you're on an on-premise Hadoop cluster. Visit your cloud.google.com/console, find APIs & auth on the left-hand-side, click Credentials, if you don't already have one click Create new Client ID, select Service account before clicking Create client id, and then for now, the connector requires a ".p12" type of keypair, so click Generate new P12 key and keep track of the .p12 file that gets downloaded. It may be convenient to rename it before placing it in a directory more easily accessible from Hadoop, e.g:

cp ~/Downloads/*.p12 /path/to/hadoop/conf/gcskey.p12

Add the following entries to your core-site.xml file in your Hadoop conf dir:

<property>
  <name>fs.gs.impl</name>
  <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem</value>
</property>
<property>
  <name>fs.gs.project.id</name>
  <value>your-ascii-google-project-id</value>
</property>
<property>
  <name>fs.gs.system.bucket</name>
  <value>some-bucket-your-project-owns</value>
</property>
<property>
  <name>fs.gs.working.dir</name>
  <value>/</value>
</property>
<property>
  <name>fs.gs.auth.service.account.enable</name>
  <value>true</value>
</property>
<property>
  <name>fs.gs.auth.service.account.email</name>
  <value>your-service-account-email@developer.gserviceaccount.com</value>
</property>
<property>
  <name>fs.gs.auth.service.account.keyfile</name>
  <value>/path/to/hadoop/conf/gcskey.p12</value>
</property>

The fs.gs.system.bucket generally won't be used except in some cases for mapred temp files, you may want to just create a new one-off bucket for that purpose. With those settings on your master node, you should already be able to test hadoop fs -ls gs://the-bucket-you-want to-list. At this point, you can already try to funnel all the data out of the master node with a simple hadoop fs -cp hdfs://yourhost:yourport/allyourdata gs://your-bucket.

If you want to speed it up using Hadoop's distcp, sync the lib/gcs-connector-1.2.8-hadoop1.jar and conf/core-site.xml to all your Hadoop nodes, and it should all work as expected. Note that there's no need to restart datanodes or namenodes.

Question 2: While the GCS connector for Hadoop is able to copy direct from HDFS without ever needing an extra disk buffer, GSUtil cannot since it has no way of interpreting the HDFS protocol; it only knows how to deal with actual local filesystem files or as you said, GCS/S3 files.

Question 3: The benefit of using the Java API is flexibility; you can choose how to handle errors, retries, buffer sizes, etc, but it takes more work and planning. Using gsutil is good for quick use cases, and you inherit a lot of error-handling and testing from the Google teams. The GCS connector for Hadoop is actually built directly on top of the Java API, and since it's all open-source, you can see what kinds of things it takes to make it work smoothly here in its source code on GitHub : https://github.com/GoogleCloudPlatform/bigdata-interop/blob/master/gcs/src/main/java/com/google/cloud/hadoop/gcsio/GoogleCloudStorageImpl.java

edited Sep 12 '19 at 20:16

GoodDok

1,770
13
28

answered Aug 16 '14 at 18:13

Dennis Huo

10,517
27
43

Thanks a lot Dennis for your detailed response, I am also considering gsutil for transferring my 50TB data to the GCS. My criteria for selecting a solution (either Hadoop connector or GsUtil) is the total amount of time it will take to upload the data to GCS. Do you think that a Hadoop solution will be faster than GsUtil (gsutil has an option for utilizing multiple cores)? Secondly, will I be able to transfer my 6GB hdfs files without any loss/change in data from HDFS to GCS using Hadoop-connector (since each file is composed of 128MB hdfs blocks)? Thanks again, – obaid Aug 16 '14 at 21:40
I just realized that the original answer only posted the first part of my response; I had tried to post it over a flaky data plan, hopefully my answers to Question 2 and Question 3 help clarify the distinction between using the GCS connector vs gsutil. Generally, gsutil's multithreading will help for uploading files from just a single local machine, but for data inside HDFS, gsutil can't really read that data directly and you'll want to use `hadoop distcp` instead, which will be able to utilize all the cores across your cluster. It should be as fast as your network allows, using distcp. – Dennis Huo Aug 17 '14 at 06:43
1

To more directly answer your followup questions: yes, using `hadoop distcp hdfs://host:port/your-current-file-dir gs://your-bucket/new-file-dir` should definitely be faster than trying to use gsutil itself. Your 6GB hdfs files should arrive in GCS unchanged/intact, even though once on GCS the actual underlying block sizes will be abstracted away (you can really just think of them as monolithic 6GB files in GCS). I'd recommend first testing with a smaller subset or maybe just single files to ensure your distcp works as intended before trying the full 50TB. – Dennis Huo Aug 17 '14 at 06:53
Thanks a lot Dennis, I really appreciate your help. I am trying your approach (GCS hadoop connector), however for some reason I cannot see a Service Account option in Credentials whenever I hit 'Create client id' to generate a key. Does it has to do something with the enabled APIs? I have 4 APIs enabled i.e GCS, GCS JSON API, Google Cloud SQL and BigQuery API. I see following options: - Installed application Runs on a desktop computer or handheld device (like Android or iPhone). - Android Learn more - Chrome Application Learn more - iOS Learn more - Other. – obaid Aug 18 '14 at 00:30
Since service accounts are often used with Google Compute Engine VMs, it's quite possibly related to enabling Google Compute Engine; it shouldn't be necessary to actually create any VMs, try enabling it and see if that opens up the service-account options. – Dennis Huo Aug 18 '14 at 08:07
Just double-checked, and I don't think service accounts require enabling any particular API. Most likely, on the first time creating a service-account, the GUI is taking you through some extra initial steps that I may have forgotten about filling about long ago in my own projects. [This section in the oauth2 docs](https://developers.google.com/console/help/#generatingoauth2) has screenshots showing what to look for and in particular, [this section](https://developers.google.com/console/help/#service_accounts) is the keypair flow I mentioned. Look for the "Application type" radio buttons. – Dennis Huo Aug 18 '14 at 18:51
Thanks Dennis, I solved the problem by just logging in to an owner account. I guess I didnt had a permission for generating the key file. My role is Writer. Thanks – obaid Aug 18 '14 at 20:20
By the way, I was just wondering how to sync the gcs jar (gcs Hadoop connector jar file) file to all hadoop nodes! In the case of hadoop distcp, -libjars is not working for my case. I set the HADOOP_CLASSPATH and also provided the –libjar to dictcp. I.e I keep getting this exception: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found exception. Just trying to think of a way to use DistributedCache from command line to somehow sync the jar file to all nodes. Would be great if you could share your approach. – obaid Aug 18 '14 at 21:00
Personally, I typically use the GCS connector as a core library, so at cluster setup time I already distribute the jarfile to all the lib/ directories on my workers either with `scp` or running an `ssh` command to download from the public GCS URL. In your case, I'm not entirely familiar with why -libjars isn't working for you, but if you've at least been able to verify that the gcs-connector setup works on the master, you can use `rsync` and the `slaves` file. – Dennis Huo Aug 20 '14 at 02:48
For example, if you installed hadoop as username `hadoop` and it's under `/home/hadoop/hadoop-install/` on every machine, you'd do: `for HOST in \`cat conf/slaves\`; do rsync lib/gcs-connector-1.2.8-hadoop1.jar hadoop@$HOST:/home/hadoop/hadoop-install/lib/gcs-connector-1.2.8-hadoop1.jar; done` and then the same for the `core-site.xml` and the `.p12` file. – Dennis Huo Aug 20 '14 at 02:50
@Dennis/@GoodDok Do we need to update the core-site.xml for all the nodes in the cluster? – mhlaskar1991 Apr 07 '21 at 13:17

score 2 · Answer 2 · answered Jan 04 '17 at 06:53

Look like few property names are changed in recent versions.

`String serviceAccount = "service-account@test.gserviceaccount.com";

String keyfile = "/path/to/local/keyfile.p12";

hadoopConfiguration.set("google.cloud.auth.service.account.enable", true); hadoopConfiguration.set("google.cloud.auth.service.account.email", serviceAccount); hadoopConfiguration.set("google.cloud.auth.service.account.keyfile", keyfile);`

Migrating 50TB data from local Hadoop cluster to Google Cloud Storage

2 Answers2

Linked