0

The main idea is this, I have a large collection of IGCSE past papers, I need to find which paper a particular question was from, and all I have is the screenshot of one question. I want to make a program that can input an image of a question, then scan a set of pdfs to find the said question, then output the pdf containing the said question. I have experience in programming but I'm a bit stuck into how to approach the problem at hand.

Solutions I have tried:

  • I tried combining pdfs into one mega pdf so I could just search the mega pdf, can't do that as the file is too large.

Solutions I think might work but not sure:

  • Making a program to read through every single pdf to find the keywords in the image.
Thisas
  • 63
  • 9
  • Hi this is just an idea, but here goes I did a bit of a google and found this. https://linux.die.net/man/1/pdftotext pdf to text You could use that to get the text of the pdf page, then store that and then you can do a search. Maybe a bit hacky and old school not too sure how effective it would be. – Richard Housham Apr 07 '20 at 14:44
  • Another idea, sharepoint has a good search and can do pdf's - upload your pdfs there and search away. – Richard Housham Apr 07 '20 at 14:45
  • Here is a question which may assist https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python – Richard Housham Apr 07 '20 at 14:47
  • So a few options there – Richard Housham Apr 07 '20 at 14:47

1 Answers1

1

Did you try the steps in https://automatetheboringstuff.com/chapter13/ ? - put all pdf's in the same folder - for each pdf go through each page - perform extractText() - use regex or something to parse this extractText for the questionstring then output pdf/page if found

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Grom
  • 50
  • 4