How to download Bulk PDF of Arxiv from Amazon S3

Question

This post was a bit long but I wanted to show you the attempts I had tried. I'm new to this area so there might be some seemingly trivial mistakes. I've been trying to figure out how to download the bulk PDFs of the ArXiv and it's been over 12 hours and it was very confusing.

The Bulk PDF Access stated from: https://info.arxiv.org/help/bulk_data/index.html

It listed three methods:

AWS for PDF and or (La)TeX source files
Kaggle for PDF
Crawling export service

Issues with method 2: The Kaggle itslef does not actually host the PDF files but the Metadata, which was useful. On https://www.kaggle.com/datasets/Cornell-University/arxiv the Bulk access, it listed the code to access the google cloud.

I tried search for arxiv-dataset in the Google Cloud's website, copied gs://arxiv-dataset/arxiv/ and gsutil cp gs://arxiv-dataset/arxiv/ cloud shell, didn't work. I tried from google.cloud import storage but conda install -c conda-forge google-cloud-sdk and conda install -c conda-forge google-cloud does not work, and pip install google-cloud did nothing, and the library could not used. I was able to install conda install -c conda-forge gsutil, however, as suggested on How to run Google gsutil using Python , it didn't work.

Issue with method 3: I tried to download the file directly using the website addresses as suggested on Kaggle, that got me some 500 error message, and then Python handling socket.error: [Errno 104] Connection reset by peer, even though I tried to limit the burst to a maximum of 4 requests per second as indicated on the website.

Eventually, I read https://arxiv.org/robots.txt , that 1 article could be downloaded every 15 seconds continuously. That basically made the bulk download of the entire arxiv pdf bulk from an option to a necessity.

Issue with method 1: I then tried to download the files from the Amazon S3. I thought it was supposed to be easy like the cloud, but it was not. The "Requester Pays buckets" on the https://info.arxiv.org/help/bulk_data_s3.html finally made sense. After register for the amazon cloud,

a. https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html from googling did not explain anything of how to use it, neither does https://s3.console.aws.amazon.com/s3/get-started?region=us-east-1&region=us-east-1 . However, on the Common tasks, I found the Download an object tab. This lead to https://docs.aws.amazon.com/AmazonS3/latest/userguide/accessing-an-object.html titled Step 3: Download an object, and directed to Amazon S3 console at https://console.aws.amazon.com/s3/, but it just directed back to the https://console.aws.amazon.com/s3/get-started?region=us-east-2.

b. After clicking around for a while, I found the cloud shell. I typed arxiv and arxiv-src on both the Search box on the upper left corner and the cloud shell, and some commands from the arxiv website didn't work.

c. One of the link leaded to https://aws.amazon.com/s3/, which didn't do anything but lead back to https://s3.console.aws.amazon.com/s3/buckets?region=us-east-2. However, this time, on the upper left corner, I found the Amazon S3 side window, which had a tab called "Buckets". I clicked Buckets, and it had a Find Buckets by name, so I typed arxiv, arxiv-src, arxiv_src, arXiv, arXiv-src, arXiv_src, etc. It showed up nothing.

d. Then I found https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html. None of the SDKs worked with python. I don't have the access to AWS CLI, and have already tried the S3 console. The number of files needed to be downloaded were large, and not fixed for each given period of time, so it's important to write a script to gather the list of files and then download them in together.

e. https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html explained the Requester Pays buckets, but didn't say much of how to use it. It did state that a x-amz-request-payer has to be specified somewhere in some methods. However, in How can I download an S3 object in a Requester Pays bucket using the AWS SDK for .NET?, the x-amz-request-payer was no where to be found.

f. Obviously, I wasn't the only one struggle with it. How to bulk download from arXiv api only for a specific field? was unanswered for almost 4 years, nether was this one Need help in downloading PDFs from arxiv dataset which is in kaggle.

g. I also learned that there's something called boto3, which can be used with python https://boto3.amazonaws.com/v1/documentation/api/latest/index.html. This seemed to be possible, but on https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html, I didn't see x-amz-request-payer or anyway to login/authenticate, so the example did not work for what I needed.

h. On https://github.com/mattbierbaum/arxiv-public-datasets/tree/a91018a0c0647ceea6fb6b2426519cebd5f0effc it listed a method https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/bin/pdfdownload.py to download the bulk files from the https://github.com/mattbierbaum/arxiv-public-datasets/blob/a91018a0c0647ceea6fb6b2426519cebd5f0effc/arxiv_public_data/s3_bulk_download.py However, the previous issue persisted, HEADERS = {'x-amz-request-payer': 'requester'} option. But where can I get that requester value?

Question:

How to find the list of files in the arxiv Requester Pays buckets? as in

src/arXiv_src_manifest.xml (s3://arxiv/src/arXiv_src_manifest.xml)
How to download the files in the arxiv Requester Pays buckets? as in

src/arXiv_src_1001_001.tar (s3://arxiv/src/arXiv_src_1001_001.tar in s3cmd URI style) src/arXiv_src_1001_002.tar (s3://arxiv/src/arXiv_src_1001_002.tar) src/arXiv_src_1001_003.tar (s3://arxiv/src/arXiv_src_1001_003.tar)

It should be very simple code and functions in python, possibly with boto3. Also, how to get x-amz-request-payer.

@AnonCoward I got the access key from ` https://console.aws.amazon.com/iam/` and ran aws s3 sync s3://arxiv/pdf/ . --request-payer ` but got `warning: Skipping file C:\Users\.....\Application Data. File/Directory is not readable.` How to get the list of files in the cloud, and then download the selected file. — ShoutOutAndCalculate, May 15 '23 at 04:08
@AnonCoward Thank you, but there were some issues. 1. why can't I find s3://arxiv/ on the amazon S3 console? i.e. how do I know it's there, even with the amazon CLI? 2. The entire s3://arxiv/ is estimated over 2TB, with over 4000 files each around 500 MB. I do need to know the list of files and the folders inside the bucks, so that it could be downloaded to perhaps two ssd drive. So I needed to a. see what's inside the bulk and what's inside the bulk's folders. b. download them individually. c. I also needed to be sure that the download could be successful because download it once cost 180$. — ShoutOutAndCalculate, May 15 '23 at 04:18
@AnonCoward I just want to see the bucket and the files and the folders in the bucket and download them. If AWS CLI works then I'd like to see the code to 1. list the sub files and the folders.(which I assume could be done recursively?) 2. download a folder or a file in the bucket to a local folder. I think you were trying to explain the 2, but could you show me how to do 1 as well? This is almost day 0 for me of the AWS interface. Thank you. — ShoutOutAndCalculate, May 15 '23 at 04:51

How to download Bulk PDF of Arxiv from Amazon S3

0 Answers0