I am working on an NLP project which deals with full-text research papers. I have a list of DOIs and I want to store all the text of these research papers in one .txt file. Currently, I am downloading the pdfs from scihub, and then extracting text from these pdfs. But, this is very slow, especially when I have a lot of papers. Are there better alternatives?
At a high-level, this is how I get the text for one paper:
!python -m PyPaperBot --doi="10.2196/29324" --dwn-dir="C:\"
and then
with fitz.open("sample.pdf") as doc:
text = ""
for page in doc:
text += page.get_text()
I took a look at some related questions (Extract abstract / full text from scientific literature given DOI or Title), but they seem very outdated. I've also looked into PubMedCentral, but it has a smaller database of research papers than sci-hub.