14

I have been struggling for about a week to download arXiv articles as mentioned here: http://arxiv.org/help/bulk_data_s3#src.

I have tried lots of things: s3Browser, s3cmd. I am able to login to my buckets but I am unable to download data from arXiv bucket.

I tried:

  1. s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar

See:

$ s3cmd get s3://arxiv/pdf/arXiv_pdf_1001_001.tar


s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar  [1 of 1]
s3://arxiv/pdf/arXiv_pdf_1001_001.tar -> ./arXiv_pdf_1001_001.tar  [1 of 1]
ERROR: S3 error: Unknown error
  1. s3cmd get with x-amz-request-payer:requester

It gave me same error again:

$ s3cmd get --add-header="x-amz-request-payer:requester" s3://arxiv/pdf/arXiv_pdf_manifest.xml
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml  [1 of 1]
s3://arxiv/pdf/arXiv_pdf_manifest.xml -> ./arXiv_pdf_manifest.xml  [1 of 1]
ERROR: S3 error: Unknown error
  1. Copying

I have tried copying files from that folder too.

$ aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .

A client error (403) occurred when calling the HeadObject operation: Forbidden
Completed 1 part(s) with ... file(s) remaining

This probably means that I made a mistake. The problem is I don't know how and what to add that will convey my permission to pay for download.

I am unable to figure out what should I do for downloading data from S3. I have been reading a lot on AWS sites, but nowhere I can get pinpoint solution to my problem.

How can I bulk download the arXiv data?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
pg2455
  • 5,039
  • 14
  • 51
  • 78
  • 1
    I think you need an AWS account and then you need to pass the `x-amz-request-payer` header like you're trying with `s3cmd`. You didn't mention if you have an AWS account – Max Feb 28 '15 at 21:07
  • Hey I have an AWS account and I have all my credit card details there. I have started doubting if that bucket actually exists there. – pg2455 Feb 28 '15 at 23:49

5 Answers5

15

Try downloading s3cmd version 1.6.0: http://sourceforge.net/projects/s3tools/files/s3cmd/

$ s3cmd --configure

Enter your credentials found in the account management tab of the Amazon AWS website interface.

$ s3cmd get --recursive --skip-existing s3://arxiv/src/ --requester-pays
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • Thank you, this worked out of the box. About 190 GB are now downloading (according to [arxiv bulk download page](http://arxiv.org/help/bulk_data_s3)) – Martin Thoma Oct 12 '15 at 11:30
4

Requester Pays is a feature on Amazon S3 buckets that requires the user of the bucket to pay Data Transfer costs associated with accessing data.

Normally, the owner of an S3 bucket pays Data Transfer costs, but this can be expensive for free / Open Source projects. Thus, the bucket owner can activated Requester Pays to reduce the portion of costs they will be charged.

Therefore, when accessing a Requester Pays bucket, you will need to authenticate yourself so that S3 knows whom to charge.

I recommend using the official AWS Command-Line Interface (CLI) to access AWS services. You can provide your credentials via:

aws configure

and then view the bucket via:

aws s3 ls s3://arxiv/pdf/

and download via:

aws s3 cp s3://arxiv/pdf/arXiv_pdf_1001_001.tar .

UPDATE: I just tried the above myself, and received Access Denied error messages (both on the bucket listing and the download command). When using s3cmd, it says ERROR: S3 error: Access Denied. It would appear that the permissions on the bucket no longer permit access. You should contact the owners of the bucket to request access.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • Thanks a lot John. This is what I thought after trying lots of things. Thanks for confirming. – pg2455 Mar 07 '15 at 23:41
  • 2
    Latest version of awscli now supports requester pays option for `aws s3` operations. Use as such: `aws s3 ls --request-payer requester s3://arxiv/src/`. If you are using pip you can update the awscli using `sudo pip install -U awscli`. – OttoV Apr 19 '18 at 20:39
  • it is possible to receive an "access denied" error when using the aws cli for omitting the --request-payer requester param. – 219CID Nov 16 '21 at 19:16
3

At the bottom of this page arXiv explains that s3cmd gets denied because it does not support access to requester pays bucket as a non-owner and you have to apply a patch to the source code of s3cmd. However, the version of s3cmd they used is outdated and the patch does not apply to the latest version of s3cmd.

Basically you need to allow s3cmd to add "x-amz-request-payer" header to its HTTP request to buckets. Here is how to fix it:

  1. Download the source code of s3cmd.
  2. Open S3/S3.py with a text editor.
  3. Add this two lines of code at the bottom of __init__ function:

    if self.s3.config.extra_headers:
        self.headers.update(self.s3.config.extra_headers)
    
  4. Install s3cmd as instructed.
POPOL
  • 218
  • 3
  • 7
3

For me the problem was that my IAM user didn't have enough permissions. Setting AmazonS3FullAccess was the solution for me.

Hope it'll save time to someone

Alan Wagner
  • 2,230
  • 1
  • 16
  • 13
  • hi can you help me out how you were able to do that – gaurav singh Jul 25 '20 at 15:07
  • 1
    @gauravsingh Go to "My Security Credentials" with your root user, then policies, and attach that policy to an IAM user you previously created. I can confirm this was the problem for me, together with the `--request-payer` argument – Dzeri96 Aug 08 '21 at 18:59
2

Don't want to steal the thunder, but OttoV's comment actually gave the right command that works for me.

aws s3 ls --request-payer requester s3://arxiv/src/

My EC2 is in Region us-east-2, but the arXiv s3 buckets are in Region us-east-1, so I think that's why the --request-payer requester is needed.

From https://aws.amazon.com/s3/pricing/?nc=sn&loc=4 :

You pay for all bandwidth into and out of Amazon S3, except for the following:

• Data transferred in from the internet.

• Data transferred out to an Amazon Elastic Compute Cloud (Amazon EC2) instance, when the instance is in the same AWS Region as the S3 bucket (including to a different account in the same AWS region).

• Data transferred out to Amazon CloudFront (CloudFront).

Star
  • 23
  • 5