1

How can we read section wise data from PDFs using python or any other language/tool. Actually I want to write a code which can give the data inside each section of the python by just typing the heading of the section. Moreover I also want to extract images from a section if there are any images there.

I tried pdfminer to scrape data from PDFs. But sometime the pdfminer separate data on the basis of lines instead of paragraph/section. Why is it so that some PDF are are properly scraped while other or not.

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LAParams
from pdfminer.converter import PDFPageAggregator
from pdfminer.pdfpage import PDFPage
from pdfminer.layout import LTTextBoxHorizontal

ans=[]
count=0
document = open('08_chapter 2.pdf', 'rb')
#Create resource manager
rsrcmgr = PDFResourceManager()
# Set parameters for analysis.
laparams = LAParams()
# Create a PDF page aggregator object.
device = PDFPageAggregator(rsrcmgr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.get_pages(document):
    count+=1
    interpreter.process_page(page)
    print("Page Count=",count,)
    # receive the LTPage object for the page.
    layout = device.get_result()
    for element in layout:
        if isinstance(element, LTTextBoxHorizontal):
            a=element.get_text()
            a=a.strip('\n')
            a=a.replace('\n',"")
            a=a.replace('\xa0',"")
            a=a.strip()
            if( len(a)==a.count(' ') ):
                continue
            else:
                ans.append(a)
        else:
            if isinstance(element,pdfminer.layout.LTFigure):
                ans.append(element.matrix)
Yash Sharma
  • 55
  • 2
  • 4

0 Answers0