0

I am trying to implement passage retrieval on PDF files. For easy navigation I want to include page number and in which passage result was belongs to (mostly passage number). like below:

query: "some query was asked"
results: "one result was displayed"
file_name: "name of file"
source: Page_no-2, passage_no:3

I have couple of pdf files, where we can separate the passage based on some recognizable pattrens. however, I am facing challenge with some pdf files, where no proper pattern was found.

when I open the pdf in chrome there are line gaps between the passages but when I trying read from fitz(pymupdf), no line gap is displayed and every line and every passage separated by just one "\n"

I tries pdfminer,pdftotext, and other libraries but no luck.

My code:

import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf')
single_doc = doc.load_page(0)  # put here the page number
text=single_doc.get_text('text', sort=True)
text

Result:

screeshot of the page -Full pdf

0 Answers0