how to extract one particular paragraph in .pdf using python

Question

Is there any way if we want to read any particular paragraph in PDF files in PYTHON3, like Abstract in the mentioned image. The pdf might have more pages and content, all i want to read is Abstract.

PDF files in the sense I am refering some set of research papers. All i want is to extract Abstract only — Anvesh, Sep 08 '21 at 10:38

Ritwik Bandyopadhyay · Answer 1 · 2021-09-08T11:37:49.190

For Python3, the best option is using PyPDF2. Install it using pip: pip install PyPDF2

Then try this out to get the string out of your required PDF:

import PyPDF2
pdfFileObj = open('filepath.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()

After this you have to look for a pattern or a specific sequence of whitespaces that signifies the start and end of the PDF files. If you are unsure as to how to determine that pattern, you can put up some of the strings that you get from your PDFs as a separate question or edit this one.

One thing I can suggest based on the image shared is:

firstpara=text.split("ABSTRACT")[1].split("1. INTRODUCTION")[0]

I am, however, not sure whether it is going to work for all your PDFS For Python2, refer to this answer from David Crow and Felipe Augusto

how to extract one particular paragraph in .pdf using python

1 Answers1