-1

I would like to know if there is any way to just take our relevant data from a pdf file. Suppose we have something like this Name:John, so we can some how automate to take just this field value in order to store it somewhere like a predefined database or file?? Thanks.

  • So you are asking for a program or an algorithm? I guess a program. Downvote for me as it seems you did not attempt anything to solve your problem. – pawamoy Apr 11 '18 at 16:35
  • 1
    As you don't mention a specific programming runtime let alone a specific PDF library, I assume you want to program everything yourself. Thus, simply take the pdf specification ISO 32000-1 or ISO 32000-2 and all relevant specifications referenced from there and study them, then start implementing. You may get a proof of concept after a few weeks, and after a few years your implementation may be fairly generic. – mkl Apr 11 '18 at 20:08
  • 1
    Read up on the PDF format - not as simple as you may assume: https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf and https://stackoverflow.com/questions/7827051/can-i-prevent-abcpdf-from-mashing-words-together-e-g-mashingwordstogether-whe. PDF is a layout language intended to position elements for printing, and not expecting to ever have to edit them. There is no DOM in the sense of an HTML-like setup. – Vanquished Wombat Apr 11 '18 at 22:15

1 Answers1

0

Use pdftotext to extract text content from your pdf file. Then parse the text file with your favorite programming language.

If your pdf doesn't contain real text, just images of text, you will need to use an optical character recognition software to extract the text.

Kris
  • 1,358
  • 13
  • 24