Need help in downloading PDFs from arxiv dataset which is in kaggle

Question

I am seeking assistance with downloading PDFs from the arxiv dataset, which is available on Kaggle, onto my local file system. Ideally, I am looking for a code that allows users to input the subject name and the number of PDFs to download. If anyone has any suggestions or resources to share, it would be greatly appreciated. **Here is the link to the arxiv dataset on Kaggle: **https://www.kaggle.com/datasets/1b6883fb66c5e7f67c697c2547022cc04c9ee98c3742f9a4d6c671b4f4eda591.

I have tried scraping the arxiv website to download PDFs. I was able to scrape all the data and even download some of the PDFs. For example, when I entered the subject name as "Machine Learning" and the number of PDFs to download as 200, only 16 PDFs were properly downloaded, while the rest were either 0kb or 3kb in size. However, when I checked the website, the PDFs were around 4mb in size.

Upon researching a solution to this problem, I discovered that bulk data access on the arxiv website is not possible due to a robot that detects automatic tasks and blocks them to avoid overloading and traffic to the website. Since millions of users access the site at a time, there is no way to access the data in bulk. As an alternative, I found that the mirror site of the main site, which is the arxiv Kaggle dataset, is updated by arxiv itself once a week. This dataset can be accessed to download PDFs.

Need help in downloading PDFs from arxiv dataset which is in kaggle

0 Answers0

Linked