How to delete all images from a PDF without corrupting it using CAM::PDF?

Question

The script below is able to remove all images from a PDF file using CAM::PDF. The output, however, is corrupt. PDF readers are nonetheless able to open it, but they complain about errors. For instance, mupdf says:

error: no XObject subtype specified
error: cannot draw xobject/image
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

Now, CAM::PDF page at CPAN (here) lists the deleteObject() method under "Deeper utilities", presumably meaning that it's not intended for public usage. Moreover, it warns that:

This function does NOT take care of dependencies on this object.

My question is: what is the right way to remove objects from a PDF file using CAM::PDF? If the issue has to do with dependencies, how can I remove an object while taking care of its dependencies?

For how to remove images from a PDF using other tools, see a related question here.

use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    )
    {
      $pdf->deleteObject ( $objnum );
    }
  }
}

$pdf->cleanoutput ( '-' );

Would you have a corrupt pdf that gives the mupdf errors available? I'm debugging a similar issue and it would be of great assistance :) — Darajan, Jan 11 '17 at 09:11

dwarring · Answer 1 · 2016-09-05T08:13:58.553

This uses CAM::PDF, but takes a slightly different approach. Rather than attempting to delete the images, which is pretty hard, it replaces each image with a transparent image.

Firstly, note that we can use image magick to generate a blank PDF that contains nothing but a transparent image:

% convert  -size 200x100 xc:none transparent.pdf

If we view the generated PDF in a text editor, we can find the main image object:

8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
...

The important thing to note here is that we have generated a transparent image as object number 8.

It then becomes matter of importing this object, and using it to replace each of the real images in the PDF, effectively blanking them.

use warnings; use strict;
use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

my $trans_pdf = CAM::PDF->new("transparent.pdf") || die "$CAM::PDF::errstr\n";
my $trans_objnum = 8; # object number of transparent image

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    ) {
        $pdf->replaceObject ( $objnum, $trans_pdf, $trans_objnum, 1 );
    }
  }
}

$pdf->cleanoutput ( '-' );

The script now replaces each image in the PDF with the imported transparent image object(object number 8 from transparent.pdf).

Why 8? Where does it come from? We wouldn't be replacing object number 8 in the original document witht a transparent image, would we? — n.r., Sep 05 '16 at 06:59
@.n.r. 8 is the object number of the image being imported from `transparent.pdf`. I've added more explanation to the answer. — dwarring, Sep 05 '16 at 08:15

score 2 · Answer 2 · answered Sep 05 '16 at 14:36

Another approach, which really deletes the images, is:

find and delete image XObjects in resource lists,
keep an array with names of deleted resources,
substitute same-length whitespace for the corresponding Do operators in each page content,
clean up and print.

Notice that dwarring's approach is safer, though, as it doesn't have to call $doc->cleanse at the end. According to the CAM::PDF documentation (here), the cleanse method

Remove unused objects. WARNING: this function breaks some PDF documents because it removes objects that are strictly part of the page model hierarchy, but which are required anyway (like some font definition objects).

I don't know how much of a problem using cleanse can be.

use CAM::PDF;
my $doc = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

# delete image XObjects among resources
# but keep their names

my @names;

foreach my $objnum ( sort { $a <=> $b } keys %{ $doc->{xref} } ) {
  my $obj = $doc->dereference( $objnum );
  next unless $obj->{value}->{type} eq 'dictionary';

  my $n = $obj->{value}->{value};

  my $resources = $doc->getValue ( $n->{Resources}       ) or next;
  my $resource  = $doc->getValue ( $resources->{XObject} ) or next;

  foreach my $name ( sort keys $resource ) {
    my $im = $doc->getValue ( $resource->{$name} ) or next;

    next unless defined $im->{Type}
            and defined $im->{Subtype}
            and $doc->getValue ( $im->{Type}    ) eq 'XObject'
            and $doc->getValue ( $im->{Subtype} ) eq 'Image';

    delete $resource->{$name};                                                                                                           
    push @names, $name;                                                                                                                  
  }                                                                                                                                      
}                                                                                                                                        


# delete the corresponding Do operators                                                                                                                        

if ( @names ) {                                                                                                                                                               
  foreach my $p ( 1 .. $doc->numPages ) {                                                                                                                                     
    my $content = $doc->getPageContent ( $p );
    my $s;
    foreach my $name ( @names ) {
      ++$s if $content =~ s{( / \Q$name\E \s+ Do \b )} { ' ' x length $1 }xeg;
    }
    $doc->setPageContent ( $p, $content ) if $s;
  }
}

$doc->cleanse;
$doc->cleanoutput;

How to delete all images from a PDF without corrupting it using CAM::PDF?

2 Answers2