0

I work at a place where I have been tasked to scan tons and tons of instructions such that they can be digitalised (about 10k pages). The scanner that I have, can take about 50-60 A4 pages at a time, which it saves as one "big" PDF file. The thing is, my boss wants each page to be a PDF by itself, and furthermore it has to be named the page number that it says inside. Not just 1, because it is the first page in the document. These instructions jump around in the numbering so it isn't that easy.

What I need help with:

1: How do I retrieve the page number from INSIDE the PDF
2: How do I do this multiple times (for each PDF document)?

I already have my program setup to create one PDF per page... I hope someone can help :)

1 Answers1

1

To get the page number, you can use OpenCV to extract the part of the page where the number is and read it using pytesseract (of course, if the page number is not always at the same place it doesn't work).

Otherwise, if the 50-60 pages you're putting into the scanner at a time have correlation in the page number (for example page 150 to 200 in the good order) then you can just specify the starting page number when you scan a batch of pages and increase this page number each time a page is read.

Let me know if it helps.

  • This sounds promising! And yes, they do correlate. I will let you know if it works :) – Habberkuk Jul 09 '21 at 08:55
  • Pytesseract is a really good shout if the simple offset method doesn't work. I think I remember installing it being slightly painful but once I got it running it was great. – Alex Waygood Jul 09 '21 at 08:57
  • @Habberkuk The fastest to implement is a simple page counter where you change the initial value before scanning a batch, but I think it's also more elegant to use OpenCV + Tesseract. – Victor Jung Jul 09 '21 at 09:26