2

I am working on an NLP project which deals with full-text research papers. I have a list of DOIs and I want to store all the text of these research papers in one .txt file. Currently, I am downloading the pdfs from scihub, and then extracting text from these pdfs. But, this is very slow, especially when I have a lot of papers. Are there better alternatives?

At a high-level, this is how I get the text for one paper:

!python -m PyPaperBot --doi="10.2196/29324" --dwn-dir="C:\"

and then

with fitz.open("sample.pdf") as doc:
    text = ""
    for page in doc:
        text += page.get_text()

I took a look at some related questions (Extract abstract / full text from scientific literature given DOI or Title), but they seem very outdated. I've also looked into PubMedCentral, but it has a smaller database of research papers than sci-hub.

  • I would use cloud server or sort and do that work in parallel. This way you can speed up as fast as possible. And it won't cost that much, I suppose. – Kota Mori Apr 14 '22 at 01:00

0 Answers0