3

I would like to download a public dataset from the NIMH Data Archive. After creating an account on their website and accepting their Data Usage Agreement, I can download a CSV file which contains the path to all the files in the dataset I am interested in. Each path is of the form s3://NDAR_Central_1/....

1 Download on my personal computer

In the NDA Github repository, the nda-tools Python library exposes some useful Python code to download those files to my own computer. Say I want to download the following file:

s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz

Given my username (USRNAME) and password (PASSWD) (the ones I used to create my account on the NIMH Data Archive), the following code allows me to download this file to TARGET_PATH on my personal computer:

from NDATools.clientscripts.downloadcmd import configure
from NDATools.Download import Download

config = configure(username=USRNAME, password=PASSWD)
s3Download = Download(TARGET_PATH, config)

target_fnames = ['s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz']

s3Download.get_links('paths', target_fnames, filters=None)
s3Download.get_tokens()
s3Download.start_workers(False, None, 1)

Behind the hood, the get_tokens method of s3Download will use USRNAME and PASSWD to generate temporary access key, secret key and security token. Then, the start_workers method will use the boto3 and s3transfer Python libraries to download the selected file.

Everything works fine !

2 Download to a GCP bucket

Now, say I created a project on GCP and would like to directly download this file to a GCP bucket.

Ideally, I would like to do something like:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

To do this, I execute the following Python code in the Cloud Shell (by running python3):

from NDATools.TokenGenerator import NDATokenGenerator
data_api_url = 'https://nda.nih.gov/DataManager/dataManager'
generator = NDATokenGenerator(data_api_url)
token = generator.generate_token(USRNAME, PASSWD)

This gives me the access key, the secret key and the session token. Indeed, in the following,

  • ACCESS_KEY refers to the value of token.access_key,
  • SECRET_KEY refers to the value of token.secret_key,
  • SECURITY_TOKEN refers to the value of token.session.

Then, I set these credentials as environment variables in the Cloud Shell:

export AWS_ACCESS_KEY_ID = [copy-paste ACCESS_KEY here]
export AWS_SECRET_ACCESS_KEY = [copy-paste SECRET_KEY here]
export AWS_SECURITY_TOKEN = [copy-paste SECURITY_TOKEN here]

Eventually, I also set up the .boto configuration file in my home. It looks like this:

[Credentials]
aws_access_key_id = $AWS_ACCESS_KEY_ID
aws_secret_access_key = $AWS_SECRET_ACCESS_KEY
aws_session_token = $AWS_SECURITY_TOKEN
[s3]
calling_format = boto.s3.connection.OrdinaryCallingFormat
use-sigv4=True
host=s3.us-east-1.amazonaws.com

When I run the following command:

gsutil cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

I end up with:

AccessDeniedException: 403 AccessDenied

The full traceback is below:

Non-MD5 etag ("a21a0b2eba27a0a32a26a6b30f3cb060-6") present for key <Key: NDAR_Central_1,submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz>, data integrity checks are not possible.
Copying s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz [Content-Type=application/x-gzip]...
Exception in thread Thread-2:iB]
Traceback (most recent call last):
  File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
    self.run()
  File "/usr/lib/python2.7/threading.py", line 754, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/daisy_chain_wrapper.py", line 213, in PerformDownload
    decryption_tuple=self.decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/cloud_api_delegator.py", line 353, in GetObjectMedia
    decryption_tuple=decryption_tuple)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 590, in GetObjectMedia
    generation=generation)
  File "/google/google-cloud-sdk/platform/gsutil/gslib/boto_translation.py", line 1723, in _TranslateExceptionAndRaise
    raise translated_exception  # pylint: disable=raising-bad-type
AccessDeniedException: AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

AccessDeniedException: 403 AccessDenied
<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>A93DBEA60B68E04D</RequestId><HostId>Z5XqPBmUdq05btXgZ2Tt7HQMzodgal6XxTD6OLQ2sGjbP20AyZ+fVFjbNfOF5+Bdy6RuXGSOzVs=</HostId></Error>

I would like to be able to directly download this file from a S3 bucket to my GCP bucket (without having to create a VM, setup Python and run the code above [which works]). Why is it that the temporary generated credentials work on my computer but do not work in GCP Cloud Shell?

The complete log of the debug command

gsutil -DD cp s3://NDAR_Central_1/submission_13364/00m/0.C.2/9007827/20041006/10263603.tar.gz gs://my-bucket

can be found here.

pitchounet
  • 298
  • 2
  • 15
  • Your .boto file credentials do not include aws_security_token. Is that deliberate? – jarmod Dec 17 '19 at 14:52
  • @jarmod Thank you, I forgot it in my post. It already is in my `.boto` file. – pitchounet Dec 17 '19 at 14:54
  • The docs are a little confusing and I don't use .boto config files. Is it `aws_security_token` or `aws_session_token`? – jarmod Dec 17 '19 at 14:56
  • @jarmod I do not think it really matters since the credentials are set in the environment variables. By the way, the log at the end of my post reads: `DEBUG 1217 15:20:11.074951 provider.py] Using access key found in environment variable. DEBUG 1217 15:20:11.075263 provider.py] Using secret key found in environment variable. DEBUG 1217 15:20:11.075522 provider.py] Using security token found in environment variable.` Therefore, the `.boto` file seems useless for setting S3 credentials here. – pitchounet Dec 17 '19 at 14:59

1 Answers1

1

The procedure you are trying to implement is called "Transfer Job"

In order to transfer a file from Amazon S3 bucket to a Cloud Storage bucket:

A. Click the Burger Menu on the top left corner

B. Go to Storage > Transfer

C. Click Create Transfer

  1. Under Select source, select Amazon S3 bucket.

  2. In the Amazon S3 bucket text box, specify the source Amazon S3 bucket name. The bucket name is the name as it appears in the AWS Management Console.

  3. In the respective text boxes, enter the Access key ID and Secret key associated with the Amazon S3 bucket.

  4. To specify a subset of files in your source, click Specify file filters beneath the bucket field. You can include or exclude files based on file name prefix and file age.

  5. Under Select destination, choose a sink bucket or create a new one.

    • To choose an existing bucket, enter the name of the bucket (without the prefix gs://), or click Browse and browse to it.
    • To transfer files to a new bucket, click Browse and then click the New bucket icon.
  6. Enable overwrite/delete options if needed.

    By default, your transfer job only overwrites an object when the source version is different from the sink version. No other objects are overwritten or deleted. Enable additional overwrite/delete options under Transfer options.

  7. Under Configure transfer, schedule your transfer job to Run now (one time) or Run daily at the local time you specify.

  8. Click Create.

Before setting up the Transfer Job please make sure you have the necessary roles assigned to your account and the required permissions described here.

Also take into consideration that the Storage Transfer Service is currently available to certain Amazon S3 regions, described under the AMAZON S3 tab, of the Setting up a transfer job

Transfer jobs can also be done programmatically. More information here

Let me know if this was helpful.

EDIT

Neither the Transfer Service or gsutil command support currently "Temporary Security Credentials" even though they are supported by AWS. A workaround to do what you want is to change the source code of the gsutil command.

I also filed a Feature Request on your behalf, I suggest you to star it in order to get updates of the procedure.

tzovourn
  • 1,293
  • 8
  • 18
  • Thank you for your answer. I should have said I already tried this. When I follow the steps you listed above, the transfer job fails with `Failed to obtain the location of the source S3 bucket. Additional details: The AWS Access Key Id you provided does not exist in our records.` It seems that it is a problem with the access key (and not with roles/rights). – pitchounet Dec 17 '19 at 16:15
  • Searching for that error I found [this article](https://aws.amazon.com/premiumsupport/knowledge-center/access-key-does-not-exist/) which might be helpful.I also found this [Topic](https://forums.aws.amazon.com/message.jspa?messageID=771815) in Amazon Discussion Forum. Finally, take a look at these: [Post 1](https://stackoverflow.com/a/53127182/11928130) and [Post 2](https://stackoverflow.com/a/45703954/11928130). What I understand from the above, is that the issue has to do with the temporary credentials you are using. – tzovourn Dec 17 '19 at 16:54
  • Thank you for these links. I also think that the problem must have something to do with the fact that my credentials are _temporary_. Still, as my post suggests, they work on my personal computer. Therefore, I assumed there would be a way of making them work as well on GCP (without having to create a VM to run Python code). Also, from what I understand from the log at the end of my post, my credentials (set in the environment variables) allowed me to connect to the host (`s3.us-east-1.amazonaws.com`), right? – pitchounet Dec 17 '19 at 17:04
  • I work for Google Cloud Platform Support. Please check the EDIT on my answer.Hope this helped. – tzovourn Dec 18 '19 at 16:22