How to recognize images within scanned PDF files?

Question

I am trying to identify images (as opposed to text) within scanned PDF files, ideally using python. Is there any way to do this? As a simple example, say you've scanned a chapter of a book. There are three possible options for a page:

Contains text only
Contains an image only (or multiple)
Contains both text and image(s)

I would like to output a list of pages that fall into category 2 or 3.

This depends a lot on your images and on your text. You'd need to look at your dataset. It could be that images have lots of colour. Text can have colour, but not much, usually. It could also be something like the largest white border you can draw around something that is larger than a character. In this case "white" needs to be "sufficiently white allowing for sensor noise" ... but still. This would recognize square pictures. Maybe not-so-much diagrams. — zBeeble, Feb 08 '17 at 19:20
How about providing some examples so we are all on the same page? — Mark Setchell, Feb 08 '17 at 22:16
For case 2, does the pdf have any text at all? like page number or title or something ? You can apply machine learning only if there is a clear demarcation between the two cases. To give you an abstract idea, the pdf needs to be converted to image. Then convert the image to grayscale and then to a vector form, where each pixel is represented as an array. If there is no text at all, the vectors will have a different pattern compared to the ones with text and images. This pattern is picked up by the neural net and hence it learns. — Arjun, Feb 09 '17 at 04:23
@MarkSetchell https://archive.org/details/adventureshuckle00twaiiala there's a downloadable pdf — iOSBeginner, Feb 09 '17 at 05:11

Mark Setchell · Accepted Answer · 2017-02-09T12:49:57.130

My idea would be to look for features that do not occur in normal text - which might be vertical, black elements spanning multiple lines. My tool of choice is ImageMagick and it is installed on most Linux distros and is available for macOS and Windows. I would just run it in the Terminal at the command prompt.

So, I would use this command - note that I added the original page to the left of the processed page on the right and put a red border around just for illustration:

magick page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 result.png

And I get this:

page-25.png

page-26.png

page-27.png

page-28.png

Explanation of command above...

In the above command, rather than thresholding, I am doing a colour reduction to 2 colours followed by a conversion to greyscale and then normalisation - basically that should choose black and the background colour as the two colours and they will become black and white when converted to greyscale and normalised.

I am then doing a median filter with a 200 pixel tall structuring element which is taller than a few lines - so it should identify tall features - vertical lines.

Explanation over

Carrying on...

So, if I invert the image so black becomes white and white becomes black, and then take the mean and multiply it by the total number of pixels in the image, that will tell me how many pixels are part of vertical features:

convert page-28.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
90224

convert page-27.png -alpha off +dither -colors 2 -colorspace gray -normalize -statistic median 1x200 -negate -format "%[fx:mean*w*h]" info:
0

So page 28 is not pure text and page 27 is.

Here are some tips...

Tip

You can see how many pages there are in a PDF, like this - though there are probably faster methods:

convert -density 18 book.pdf info:

Tip

You can extract a page of a PDF like this:

convert -density 288 book.pdf[25] page-25.png

Tip

If you are doing multiple books, you will probably want to normalise the images so that they are all, say, 1000 pixels tall then the size of the structuring element (for calculating the median) should be fairly consistent.

Your method seems to take about a whole 2-3 seconds.Is there a faster approach to this? I — Dhruva, Aug 02 '18 at 16:32
@Dhruva Who knows? It depends on lots of things! Your CPU, your RAM, your OS, the size of your PDFs, the number of PDFs, the resolution of the images within your PDFs, what you actually want to determine... if you have a specific question, feel free to ask a new one (it's free) and maybe include a link back to this one for reference - get the link by clicking `share`. — Mark Setchell, Aug 02 '18 at 17:03
@ Mark Setchell Why don't you just count of the black pixels? `numpy` libraries are generally fast at doing such tasks. — Dhruva, Aug 03 '18 at 09:53

How to recognize images within scanned PDF files?

1 Answers1

Linked