I would like to extract only the strike-out text from a .pdf file. I have tried the below code, it is working with a sample pdf file I have. But it is not working with another pdf file which I think is a scanned one. Is there any standard way to extract only strike-out text from a pdf file using python? Any help would be really appreciated.
This is the code I was using:
from pydoc import doc
from pdf2docx import parse
from typing import Tuple
from docx import Document
def convert_pdf2docx(input_file: str, output_file: str, pages: Tuple = None):
"""Converts pdf to docx"""
if pages:
pages = [int(i) for i in list(pages) if i.isnumeric()]
result = parse(pdf_file=input_file,
docx_with_path=output_file, pages=pages)
summary = {
"File": input_file, "Pages": str(pages), "Output File": output_file
}
if __name__ == "__main__":
pdf_file = 'D:/AWS practice/sample_striken_out.pdf'
doc_file = 'D:/AWS practice/sample_striken_out.docx'
convert_pdf2docx(pdf_file, doc_file)
document = Document(doc_file)
with open('D:/AWS practice/sample_striken_out.txt', 'w') as f:
for p in document.paragraphs:
for run in p.runs:
if not run.font.strike:
f.write(run.text)
print(run.text)
f.write('\n')
Note: I am converting PDF to DOCX first and then trying to identify the strike-out text. This code is working with a sample file. But it is not working with the scanned pdf file. The pdf to doc conversion is taking place, but the strike-through detection does not.