6

So I often run huge double-sided scan jobs on an unintelligent Canon multifunction, which leaves me with a huge folder of JPEGs. Am I insane to consider using PIL to analyze a folder of images to detect scans of blank pages and flag them for deletion?

Leaving the folder-crawling and flagging parts out, I imagine this would look something like:

  • Check if the image is greyscale, as this is presumed uncertain.
  • If so, detect the dominant range of shades (background colour).
  • If not, detect the dominant range of shades, restricting to light greys.
  • Determine what percentage of the entire image is composed of said shades.
  • Try to find a threshold that adequately detects pages with type or writing or imagery.
  • Perhaps test fragments of the image at a time to increase accuracy of threshold.

I know this is sort of an edge case, but can anyone with PIL experience lend some pointers?

skaffman
  • 398,947
  • 96
  • 818
  • 769

3 Answers3

13

Here is an alternative solution, using mahotas and milk.

  1. Start by creating two directories: positives/ and negatives/ where you will manually pick out a few examples.
  2. I will assume that the rest of the data is in an unlabeled/ directory
  3. Compute features for all of the images in positives and negatives
  4. learn a classifier
  5. use that classifier on the unlabeled images

In the code below I used jug to give you the possibility of running it on multiple processors, but the code also works if you remove every line which mentions TaskGenerator

from glob import glob
import mahotas
import mahotas.features
import milk
from jug import TaskGenerator


@TaskGenerator
def features_for(imname):
    img = mahotas.imread(imname)
    return mahotas.features.haralick(img).mean(0)

@TaskGenerator
def learn_model(features, labels):
    learner = milk.defaultclassifier()
    return learner.train(features, labels)

@TaskGenerator
def classify(model, features):
     return model.apply(features)

positives = glob('positives/*.jpg')
negatives = glob('negatives/*.jpg')
unlabeled = glob('unlabeled/*.jpg')


features = map(features_for, negatives + positives)
labels = [0] * len(negatives) + [1] * len(positives)

model = learn_model(features, labels)

labeled = [classify(model, features_for(u)) for u in unlabeled]

This uses texture features, which is probably good enough, but you can play with other features in mahotas.features if you'd like (or try mahotas.surf, but that gets more complicated). In general, I have found it hard to do classification with the sort of hard thresholds you are looking for unless the scanning is very controlled.

luispedro
  • 6,934
  • 4
  • 35
  • 45
  • 2
    Impressive libraries you have written! – Christopher O'Donnell Mar 31 '11 at 21:44
  • Apologizes for the nit-pick but, the variable features is used twice once as a function and next as a list. Then called as a function. Shouldn't the features list be something like features_learned and applied to the learner model without unlinking the original function? Thats the only way I was able to apply the snippet. Thanks for the awesome libraries all around. They work great! Thanks. – TelsaBoil May 08 '11 at 16:32
  • What is `features.haralick` supposed to mean ? GLCM ? The proposed statistics extracted from GLCM ? But then taking the mean of this last one makes little sense. At the same time it makes more sense to use the later as a feature set for classification. So you are using the former and classifying with a single feature per image. Why didn't you use those 14 measurements (or a subset of them) proposed by Haralick ? – mmgp Feb 19 '13 at 16:39
  • features.haralick are the 14 measurements by Haralick (actually, by default, the last feature is excluded)! Each of the measurements is performed in 4 directions. Haralick suggested then both averaging and ``ptp()``ing them, to obtain 28 features. Here, I just did the averaging. – luispedro Feb 20 '13 at 07:18
  • milk is out of date and unmaintained: is there any other solution? – user898678 Feb 24 '19 at 12:58
5

Just as a first try, sort your image folder by file size. If all scans from one document have the same resolution the blank pages will certainly result in smaller files than the non-blank ones.

I don't know how many pages you are scanning, but if the number is low enough this could be a simple quick fix.

jilles de wit
  • 7,060
  • 3
  • 26
  • 50
3

A few non-PIL-specific suggestions to consider:

Scans of printed or written material will have lots of high-contrast sharp edges; something like a median filter (to reduce noise) followed by some kind of simple edge detection might do a good job of discriminating real content from blank pages.

Testing fragments at a time is useful not only because it might increase your accuracy, but because it might help you to give up early on many pages. Presumably most of your scans are not blank, so you should begin with a simple-minded check that usually identifies non-blank pages as non-blank; only if it says the page might be blank do you need to look more closely.

In case either the illumination or the page itself is nonuniform, you might want to begin by doing something like image = image-filter(image) where filter does a very broad smoothing of some kind. That will reduce the need to identify the dominant shades, as well as coping when the dominant shade isn't quite uniform across the page.

Gareth McCaughan
  • 19,888
  • 1
  • 41
  • 62
  • +1 Good advice. I think maybe even a simple image entropy calculation would be a good enough discriminator of the "emptiness" of a page. http://brainacle.com/calculating-image-entropy-with-python-how-and-why.html – Paul Mar 25 '11 at 06:49
  • Great point, Paul. I work with a histogram every day yet never considered calculating entropy. – Christopher O'Donnell Mar 26 '11 at 02:46