This post was a bit long but I wanted to show you the attempts I had tried. I'm new to this area so there might be some seemingly trivial mistakes. I've been trying to figure out how to download the bulk PDFs of the ArXiv and it's been over 12 hours and it was very confusing.
The Bulk PDF Access stated from: https://info.arxiv.org/help/bulk_data/index.html
It listed three methods:
- AWS for PDF and or (La)TeX source files
- Kaggle for PDF
- Crawling export service
Issues with method 2: The Kaggle itslef does not actually host the PDF files but the Metadata, which was useful. On https://www.kaggle.com/datasets/Cornell-University/arxiv the Bulk access, it listed the code to access the google cloud.
I tried search for arxiv-dataset
in the Google Cloud's website, copied gs://arxiv-dataset/arxiv/
and gsutil cp gs://arxiv-dataset/arxiv/
cloud shell, didn't work. I tried from google.cloud import storage
but conda install -c conda-forge google-cloud-sdk
and conda install -c conda-forge google-cloud
does not work, and pip install google-cloud
did nothing, and the library could not used. I was able to install conda install -c conda-forge gsutil
, however, as suggested on How to run Google gsutil using Python , it didn't work.
Issue with method 3: I tried to download the file directly using the website addresses as suggested on Kaggle, that got me some 500 error message, and then Python handling socket.error: [Errno 104] Connection reset by peer, even though I tried to limit the burst to a maximum of 4 requests per second as indicated on the website.
Eventually, I read https://arxiv.org/robots.txt , that 1 article could be downloaded every 15 seconds continuously. That basically made the bulk download of the entire arxiv pdf bulk from an option to a necessity.
Issue with method 1: I then tried to download the files from the Amazon S3. I thought it was supposed to be easy like the cloud, but it was not. The "Requester Pays buckets" on the https://info.arxiv.org/help/bulk_data_s3.html finally made sense. After register for the amazon cloud,
a. https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html from googling did not explain anything of how to use it, neither does https://s3.console.aws.amazon.com/s3/get-started?region=us-east-1®ion=us-east-1 . However, on the Common tasks
, I found the Download an object
tab. This lead to https://docs.aws.amazon.com/AmazonS3/latest/userguide/accessing-an-object.html titled Step 3: Download an object
, and directed to Amazon S3 console at https://console.aws.amazon.com/s3/
, but it just directed back to the https://console.aws.amazon.com/s3/get-started?region=us-east-2.
b. After clicking around for a while, I found the cloud shell. I typed arxiv
and arxiv-src
on both the Search
box on the upper left corner and the cloud shell, and some commands from the arxiv website didn't work.
c. One of the link leaded to https://aws.amazon.com/s3/, which didn't do anything but lead back to https://s3.console.aws.amazon.com/s3/buckets?region=us-east-2. However, this time, on the upper left corner, I found the Amazon S3 side window, which had a tab called "Buckets"
. I clicked Buckets, and it had a Find Buckets by name
, so I typed arxiv
, arxiv-src
, arxiv_src
, arXiv
, arXiv-src
, arXiv_src
, etc. It showed up nothing.
d. Then I found https://docs.aws.amazon.com/AmazonS3/latest/userguide/download-objects.html. None of the SDKs worked with python. I don't have the access to AWS CLI
, and have already tried the S3 console
. The number of files needed to be downloaded were large, and not fixed for each given period of time, so it's important to write a script to gather the list of files and then download them in together.
e. https://docs.aws.amazon.com/AmazonS3/latest/userguide/ObjectsinRequesterPaysBuckets.html explained the Requester Pays buckets
, but didn't say much of how to use it. It did state that a x-amz-request-payer
has to be specified somewhere in some methods. However, in How can I download an S3 object in a Requester Pays bucket using the AWS SDK for .NET?, the x-amz-request-payer
was no where to be found.
f. Obviously, I wasn't the only one struggle with it. How to bulk download from arXiv api only for a specific field? was unanswered for almost 4 years, nether was this one Need help in downloading PDFs from arxiv dataset which is in kaggle.
g. I also learned that there's something called boto3
, which can be used with python https://boto3.amazonaws.com/v1/documentation/api/latest/index.html. This seemed to be possible, but on https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-examples.html, I didn't see x-amz-request-payer
or anyway to login/authenticate, so the example did not work for what I needed.
h. On https://github.com/mattbierbaum/arxiv-public-datasets/tree/a91018a0c0647ceea6fb6b2426519cebd5f0effc it listed a method https://github.com/mattbierbaum/arxiv-public-datasets/blob/master/bin/pdfdownload.py to download the bulk files from the https://github.com/mattbierbaum/arxiv-public-datasets/blob/a91018a0c0647ceea6fb6b2426519cebd5f0effc/arxiv_public_data/s3_bulk_download.py
However, the previous issue persisted, HEADERS = {'x-amz-request-payer': 'requester'}
option. But where can I get that requester
value?
Question:
How to find the list of files in the arxiv
Requester Pays buckets
? as insrc/arXiv_src_manifest.xml (s3://arxiv/src/arXiv_src_manifest.xml)
How to download the files in the arxiv
Requester Pays buckets
? as insrc/arXiv_src_1001_001.tar (s3://arxiv/src/arXiv_src_1001_001.tar in s3cmd URI style) src/arXiv_src_1001_002.tar (s3://arxiv/src/arXiv_src_1001_002.tar) src/arXiv_src_1001_003.tar (s3://arxiv/src/arXiv_src_1001_003.tar)
It should be very simple code and functions in python, possibly with boto3. Also, how to get x-amz-request-payer.