0

I need extract images from a PDF page section.

For example consider there is a PDF page which has couple of images on top of the page & couple of images on bottom of the page. I want to extract the images on top of the page.

So far what I tried is :

  • Using ghostscript cropped the pdf - gs -o$croppedPdfFilepath -sDEVICE=pdfwrite -c "[/CropBox [31.46 690.22 560.54 839]" -c "/PAGES pdfmark" -sPageList=12 -f $originalPdfFilepath
  • Then pass the cropped image to pdfimages to extract the images - pdfimages -j "$croppedPdfFilepath" $outputDirectory/image

But the problem is pdfimages is extracting all the images on that page (From the top & the bottom), even though when I view the cropped PDF it has only the images on top of the page.

After some research it looks like the CropBox only hides the cropped content from view but the PDF source still has the content.

Any guidance to remove the content from the PDF page or any other approach will be helpful. I'm using php to do it programatically.

References

Saumini Navaratnam
  • 8,439
  • 3
  • 42
  • 70
  • If you're only interested in getting the top and bottom images, would it be viable to just get the first and last result returned by pdfimages? – Moudi Dec 14 '22 at 11:24
  • @Moudi Hey, I'm interested on getting only the images from top of the page – Saumini Navaratnam Dec 14 '22 at 11:28
  • In that case, is the first image in ```pdfimages``` the image on the top of the page? If that's correct, then you would be able to achieve this by just getting the very first element in ```pdfimages``` – Moudi Dec 14 '22 at 11:33
  • @Moudi So the number of images is dynamic. It can be one or five or N. I won't know how many there before hand. I can calculate where the photos section starts & ends. Thanks for helping. – Saumini Navaratnam Dec 14 '22 at 11:40
  • 2
    Permanent removal of content from a PDF or PDF page is termed "redaction". So there's a search term to get you started... – johnwhitington Dec 14 '22 at 17:13

1 Answers1

1

If you need to extract images based on their page position, you can do it pretty easily with pdftohtml by parsing the output and then checking for the position of elements using their xml attributes. Here's a very basic example that puts the full path of images in an array if they are positioned less than 200 from the top:

$pdf   = '/path/to/test.pdf';
$files = [];
$xml   = shell_exec('pdftohtml -stdout -xml ' . $pdf);
$dom   = new DOMDocument();
$dom->loadXml($xml);
$images = $dom->getElementsByTagName('image');
foreach ($images as $image) {
    $top = $image->getAttribute('top');
    if ($top < 200) {
        $files[] = dirname($pdf) . '/' . $image->getAttribute('src');
    }
}
print_r($files);

Note, contrary to the man page for pdftohtml, which indicates that it "generates its output in the current working directory", my experience is that it will always generate output in the same directory as the pdf being read.