1

I have a pdf file that listimages.pl which uses CAM::PDF returns nothing but using PDF::GetImages will extract an image. Using the following code I can find the image object but I don't know how to extract that to a file. And I can not figure out why the command line tools don't work.

#!/usr/bin/perl -w
use strict;

use Cwd;
use File::Basename;
use Data::Dumper;
use CAM::PDF;
use CAM::PDF::PageText;
use CAM::PDF::Renderer::Images;

my $file = shift @ARGV || die "Usage: get-pdf-images /path/to/file.pdf \n";

my $pdf = CAM::PDF->new($file) || die "$CAM::PDF::errstr\n";

#print $pdf->toString();

foreach my $p ( 1 .. $pdf->numPages() ) {
    my $page = $pdf->getPageContentTree($p);
    my $str = $pdf->getPageText($p);
    if (defined $str) {
#        CAM::PDF->asciify(\$str);
        print $str;
    }

    print "-------------------------------\n";
    my $gs = $page->findImages();
    my @imageNodes = @{$gs->{images}};
    print "Found " . scalar @imageNodes . " images on page $p\n";
    print Data::Dumper->Dump([\@imageNodes],['imageNodes']);
}

If I run `pdfinfo.pl`` it reports:

$ pdfinfo.pl test.pdf
File:         test.pdf
File Size:    4599 bytes
Pages:        1
Author:       þÿadmin01
CreationDate: Fri Jan  3 03:48:53 2014
Creator:      þÿPDFCreator Version 1.7.2
Keywords:
ModDate:      Fri Jan  3 03:48:53 2014
Producer:     GPL Ghostscript 9.10
Subject:
Title:        þÿVision6Card
Page Size:    variable
Optimized:    no
PDF version:  1.4
Security
  Passwd:     none
  Print:      yes
  Modify:     yes
  Copy:       yes
  Add:        yes

The test.pdf file can be downloaded from here: http://imaptools.com:8080/dl/test.pdf

Stephen Woodbridge
  • 1,100
  • 1
  • 8
  • 16
  • The image in question is a 3x10 pixel image which is encoded as an inline image. Maybe listimages.pl only recognizes xobject images? Adobe Acrobat Preflight when analyzing the internal PDF structure furthermore says "PDFEngine error: severity:4, system:0, error:3" for this image. Thus, maybe the image embedding is broken and listimages.pl for that reason does not find it? Furthermore I don't see that image when the PDF is displayed. Maybe listimages.pl only extracts visible images? – mkl Jan 16 '14 at 07:47
  • I also got errors from http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx but I do not think that is the problem, because PDF::GetImages and the commandline tool pdfimages both successfully extract the image. I'm using CAM::PDF to extract other information and would like to use it to also extract the images. – Stephen Woodbridge Jan 16 '14 at 18:11

1 Answers1

1

Some parts of CAM::PDF are unfinished. If you look at source of listimages.pl, you'll see that content parsing for inline images is somewhat primitive, e.g. it doesn't allow unmatched parens between BI and EI (as is the case) and so doesn't find the image here. There's uninlinepdfimages.pl, it uses another heuristics to parse for inline images, but for this file it seems to hang and I don't have the intention to look into what confuses it. And, CAM::PDF::Renderer::Images, as in your code, is another take on the same problem and finally it does proper parsing of content stream, but the library seems to provide no means to help to extract image data here. But if you need it VERY much, I see no technical problem (except your time), given information in @imageNodes (width, height, depth, compression used, imagedata), to extract image programatically.

user2846289
  • 2,185
  • 13
  • 16
  • 1
    Agreed. I'm the author of CAM-PDF. When I first wrote it (back in 2002) I was trying to achieve some very specific goals, and I added features as I needed them. Many of the higher-level tools (like listimages.pl and pdftotext.pl) are just heuristics and do not even try to cover all possibilities. – Chris Dolan Jan 17 '14 at 03:13
  • Thanks for all the feedback and suggestions. It turns out that the 3x10 image in the example is not what I wanted anyway. So I took the approach of extract the text I needed using CAM::PDF and then using ImageMagick to render the PDF as a jpg. I'm new to manipulating PDFs and I have learned a lot - Thanks! – Stephen Woodbridge Jan 17 '14 at 14:25