I'm currently trying to extract information from lots of PDF forms such as this:
The text 'female' should be extracted here. So contrary to my title, I'm actually trying to extract text with no strikethroughs rather than text that with strikethroughs. But if I can identify which words with strikethroughs, I can easily identify the inverse.
Gaining inspiration from this post, I came up with this set of codes:
import os
import glob
from pdf2docx import parse
from docx import Document
lst = []
files = glob.glob(os.getcwd() + r'\PDFs\*.pdf')
for i in range(len(files)):
filename = files[i].split('\\')[-1].split('.')[-2]
parse(files[i])
document = Document(os.getcwd() + rf'\PDFs\{filename}.docx')
for p in document.paragraphs:
for run in p.runs:
if run.font.strike:
lst.append(run.text)
os.remove(os.getcwd() + rf'\PDFs\{filename}.docx')
What the above code does is to convert all my PDF files into word documents (docx), and then search through the word documents for text with strikethroughs, extract those text, then delete the word document.
As you may have rightfully suspected, this set of code is very slow and inefficient, taking about 30s to run on my sample set of 4 PDFs with less than 10 pages combined.
I don't believe this is the best way to do this. However, when I did some research online, pdf2docx extracts data from PDFs using PyMuPDF, but yet PyMuPDF do not come with the capability to recognise strikethroughs in PDF text. How could this be so? When pdf2docx could perfectly convert strikethroughs in PDFs into docx document, indicating that the strikethroughs are being recognised at some level.
All in all, I would like to seek advice on whether or not it is possible to extract text with strikethroughs in PDF using Python. Thank you!