1

I'm trying to extract text from pdf files (similar to a form). Currently, I open the file on Chrome, select/copy all the text, paste it into a txt file and process it into CSV using Python. Chrome allows me to have data quite structured and uniform, so that every page of the pdf results in a similar block of text, allowing me to process it easily.

I'm trying to extract the text directly from the pdf, to process it into CSV format, but I always get some messy results, due to the way the original pdf is generated. I've tried pdfminer and pyPdf2, but the results get messy when the form has a missing value in certain fields.

Maybe it's a generalistic question, but, how can I have a more structured result in my extraction?

1 Answers1

0

Not all PDFs have embedded texts. some are texts in embedded images. Hence, to get a common solution that works for all PDFs, is to use OCR.

Step 1) Convert the PDF to an image

Step 2) Use pytessract to perform OCR: Use pytesseract OCR to recognize text from an image

Joshua
  • 551
  • 4
  • 13
  • is there any way to detect text that has been highlighted with a physical or digital highlighter? – oldboy Oct 24 '21 at 21:16