How to search a set of pdfs, with only an image segment of a page

Question

The main idea is this, I have a large collection of IGCSE past papers, I need to find which paper a particular question was from, and all I have is the screenshot of one question. I want to make a program that can input an image of a question, then scan a set of pdfs to find the said question, then output the pdf containing the said question. I have experience in programming but I'm a bit stuck into how to approach the problem at hand.

Solutions I have tried:

I tried combining pdfs into one mega pdf so I could just search the mega pdf, can't do that as the file is too large.

Solutions I think might work but not sure:

Making a program to read through every single pdf to find the keywords in the image.

Hi this is just an idea, but here goes I did a bit of a google and found this. https://linux.die.net/man/1/pdftotext pdf to text You could use that to get the text of the pdf page, then store that and then you can do a search. Maybe a bit hacky and old school not too sure how effective it would be. — Richard Housham, Apr 07 '20 at 14:44
Another idea, sharepoint has a good search and can do pdf's - upload your pdfs there and search away. — Richard Housham, Apr 07 '20 at 14:45
Here is a question which may assist https://stackoverflow.com/questions/17098675/searching-text-in-a-pdf-using-python — Richard Housham, Apr 07 '20 at 14:47

score 1 · Accepted Answer · edited Apr 07 '20 at 14:59

1

Did you try the steps in https://automatetheboringstuff.com/chapter13/ ? - put all pdf's in the same folder - for each pdf go through each page - perform extractText() - use regex or something to parse this extractText for the questionstring then output pdf/page if found

edited Apr 07 '20 at 14:59

desertnaut

57,590
26
140
166

answered Apr 07 '20 at 14:50

Grom

50
4

Exactly the type of solution I was searching for. – Thisas Apr 12 '20 at 12:37

How to search a set of pdfs, with only an image segment of a page

1 Answers1