Accessing GCS with Hadoop client from outside of Cloud

Question

I want to access Google Cloud Storage via Hadoop client. I want to use it on machine outside of Google Cloud.

I followed instructions from here. I created service account and generated key file. I also created core-site.xml file and downloaded the necessary library.

However, when I am trying to run simple hdfs dfs -ls gs://bucket-name command, all I get is this:

Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

When I am doing this inside the Google Cloud it works, but trying to connect to GCS from outside, it shows error above.

How to connect to GCS with Hadoop Client in this way? Is it even possible? I have no route to 169.254.169.254 address.

Here is my core-site.xml(I changed the key path and email in this example):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>spark.hadoop.google.cloud.auth.service.account.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
    <value>path/to/key.json</value>
  </property>
  <property>
    <name>fs.gs.project.id</name>
    <value>ringgit-research</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>The AbstractFileSystem for gs: uris.</description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>myserviceaccountaddress@google</value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
</configuration>

score 0 · Answer 1 · answered Apr 26 '19 at 22:12

could be that the hadoop services haven’t taken the updates made in your core-site.xml file yet, so my suggestion is restart the hadoop’s services,another action that you can take is check the Access control options[1].

If You still having the same issue after having taken the action suggested, please post the complete error message.

[1]https://cloud.google.com/storage/docs/access-control/

score 0 · Accepted Answer · answered May 06 '19 at 07:48

The problem is with the fact that I've tried wrong authentication method. Used method assumes that it's running inside google cloud and it's trying to connect to google metadata servers. When running outside of google it doesn't work for obvious reasons.

The answer to this is here: Migrating 50TB data from local Hadoop cluster to Google Cloud Storage with the proper core-site.xml in the selected answer.

Property fs.gs.auth.service.account.keyfile should be used instead of spark.hadoop.google.cloud.auth.service.account.json.keyfile. The only difference is that this property needs p12 key file instead of json.

Accessing GCS with Hadoop client from outside of Cloud

2 Answers2