Unable to separate the passages, as no separation character is being displayed

Question

I am trying to implement passage retrieval on PDF files. For easy navigation I want to include page number and in which passage result was belongs to (mostly passage number). like below:

query: "some query was asked"
results: "one result was displayed"
file_name: "name of file"
source: Page_no-2, passage_no:3

I have couple of pdf files, where we can separate the passage based on some recognizable pattrens. however, I am facing challenge with some pdf files, where no proper pattern was found.

when I open the pdf in chrome there are line gaps between the passages but when I trying read from fitz(pymupdf), no line gap is displayed and every line and every passage separated by just one "\n"

I tries pdfminer,pdftotext, and other libraries but no luck.

My code:

import fitz
from more_itertools import *
doc = fitz.open('IT_past.pdf')
single_doc = doc.load_page(0)  # put here the page number
text=single_doc.get_text('text', sort=True)
text

Result:

screeshot of the page -Full pdf

\n means new line so it is what you see instead of a blank line. — jose_bacoy, Sep 01 '22 at 20:00
But a line with in passage, and two different passage are also separated by \n only for given pdf. I am confusing how to differentiate. — Saivenkataraju, Sep 02 '22 at 06:59
I have added another precise question here; https://stackoverflow.com/questions/73580435/ways-to-separate-passages-in-pdf-using-gap — Saivenkataraju, Sep 02 '22 at 10:05

Unable to separate the passages, as no separation character is being displayed

0 Answers0