Is there any way if we want to read any particular paragraph in PDF files in PYTHON3, like Abstract in the mentioned image. The pdf might have more pages and content, all i want to read is Abstract.
Asked
Active
Viewed 3,144 times
0

Anvesh
- 97
- 2
- 11
-
1PDF files in the sense I am refering some set of research papers. All i want is to extract Abstract only – Anvesh Sep 08 '21 at 10:38
1 Answers
1
For Python3, the best option is using PyPDF2. Install it using pip:
pip install PyPDF2
Then try this out to get the string out of your required PDF:
import PyPDF2
pdfFileObj = open('filepath.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
text=pageObj.extractText()
After this you have to look for a pattern or a specific sequence of whitespaces that signifies the start and end of the PDF files. If you are unsure as to how to determine that pattern, you can put up some of the strings that you get from your PDFs as a separate question or edit this one.
One thing I can suggest based on the image shared is:
firstpara=text.split("ABSTRACT")[1].split("1. INTRODUCTION")[0]
I am, however, not sure whether it is going to work for all your PDFS For Python2, refer to this answer from David Crow and Felipe Augusto

Ritwik Bandyopadhyay
- 125
- 1
- 9