I am trying to implement passage retrieval on PDF files. For easy navigation I want to include page number and in which passage result was belongs to (mostly passage number). like below:
query: "some query was asked"
results: "one result was displayed"
file_name: "name of file"
source: Page_no-2, passage_no:3
I have couple of pdf files, where we can separate the passage based on some recognizable pattrens. however, I am facing challenge with some pdf files, where no proper pattern was found.
when I open the pdf in chrome there are line gaps between the passages but when I trying read from fitz(pymupdf), no line gap is displayed and every line and every passage separated by just one "\n"
I tries pdfminer,pdftotext, and other libraries but no luck.
My code:
import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf')
single_doc = doc.load_page(0) # put here the page number
text=single_doc.get_text('text', sort=True)
text
Result: