1

I have something like 40 million TIFF documents, all 1-bit single page duplex. In about 40% of cases, the back image of these TIFFs is 'blank' and I'd like to remove them before I do a load to a CMS to reduce space requirements.

Is there a simple method to look at the data content of each page and delete it if it falls under a preset threshold, say 2% 'black'?

I'm technology agnostic on this one, but a C# solution would probably be the easiest to support. Problem is, I've no image manipulation experience so don't really know where to start.

Edit to add: The images are old scans and so are 'dirty', so this is not expected to be an exact science. The threshold would need to be set to avoid the chance of false positives.

royhowie
  • 11,075
  • 14
  • 50
  • 67
Lunatik
  • 3,838
  • 6
  • 37
  • 52

3 Answers3

3

You probably should:

  • open each image
  • iterate through its pages (using Bitmap.GetFrameCount / Bitmap.SelectActiveFrame methods)
  • access bits of each page (using Bitmap.LockBits method)
  • analyze contents of each page (simple loop)
  • if contents is worthwhile then copy data to another image (Bitmap.LockBits and a loop)

This task isn't particularly complex but will require some code to be written. This site contains some samples that you may search for using method names as keywords).

P.S. I assume that all of images can be successfully loaded into a System.Drawing.Bitmap.

Bobrovsky
  • 13,789
  • 19
  • 80
  • 130
  • Thanks. Did you mean to include a link? – Lunatik Apr 19 '11 at 15:28
  • @Lunatik, not really. I mean that stackoverflow.com contains the samples. For a start, maybe code from my answers to other questions (http://stackoverflow.com/questions/3566650/convert-multipage-tiff-to-png-net/3568649#3568649, http://stackoverflow.com/questions/3414072/convert-bitonal-tiff-to-bitonal-png-in-c/3415886#3415886) will be useful for your task. – Bobrovsky Apr 19 '11 at 15:47
1

You can do something like that with DotImage (disclaimer, I work for Atalasoft and have written most of the underlying classes that you'd be using). The code to do it will look something like this:

public void RemoveBlankPages(Stream source stm)
{
    List<int> blanks = new List<int>();
    if (GetBlankPages(stm, blanks)) {
        // all pages blank - delete file?  Skip?  Your choice.
    }
    else {
        // memory stream is convenient - maybe a temp file instead?
        using (MemoryStream ostm = new MemoryStream()) {
            // pulls out all the blanks and writes to the temp stream
            stm.Seek(0, SeekOrigin.Begin);
            RemoveBlanks(blanks, stm, ostm);
            CopyStream(ostm, stm); // copies first stm to second, truncating at end
        }
    }
}

private bool GetBlankPages(Stream stm, List<int> blanks)
{
    TiffDecoder decoder = new TiffDecoder();
    ImageInfo info = decoder.GetImageInfo(stm);
    for (int i=0; i < info.FrameCount; i++) {
        try {
            stm.Seek(0, SeekOrigin.Begin);
            using (AtalaImage image = decoder.Read(stm, i, null)) {
                if (IsBlankPage(image)) blanks.Add(i);
            }
        }
        catch {
            // bad file - skip? could also try to remove the bad page:
            blanks.Add(i);
        }
    }
    return blanks.Count == info.FrameCount;
}

private bool IsBlankPage(AtalaImage image)
{
    // you might want to configure the command to do noise removal and black border
    // removal (or not) first.
    BlankPageDetectionCommand command = new BlankPageDetectionCommand();
    BlankPageDetectionResults results = command.Apply(image) as BlankPageDetectionResults;
    return results.IsImageBlank;
}

private void RemoveBlanks(List<int> blanks, Stream source, Stream dest)
{
    // blanks needs to be sorted low to high, which it will be if generated from
    // above
    TiffDocument doc = new TiffDocument(source);
    int totalRemoved = 0;
    foreach (int page in blanks) {
        doc.Pages.RemoveAt(page - totalRemoved);
        totalRemoved++;
    }
    doc.Save(dest);
}

You should note that blank page detection is not as simple as "are all the pixels white(-ish)?" since scanning introduces all kinds of interesting artifacts. To get the BlankPageDetectionCommand, you would need the Document Imaging package.

plinth
  • 48,267
  • 11
  • 78
  • 120
0

Are you interested in shrinking the files or just want to avoid people wasting their time viewing blank pages? You can do a quick and dirty edit of the files to rid yourself of known blank pages by just patching the second IFD to be 0x00000000. Here's what I mean - TIFF files have a simple layout if you're just navigating through the pages:

TIFF Header (4 bytes) First IFD offset (4 bytes - typically points to 0x00000008)

IFD:

Number of tags (2-bytes)

{individual TIFF tags} (12-bytes each)

Next IFD offset (4 bytes)

Just patch the "next IFD offset" to a value of 0x00000000 to "unlink" pages beyond the current one.

BitBank
  • 8,500
  • 3
  • 28
  • 46