7

I needed a bash script which extracts all the raster and vector images from the pdf and convert them to jpg format.

I checked many posts on the web and I got most of the ideas from these
How can I extract images from a PDF file?
Count the number of the raster images in the pdf
How to extract a vector figure from pdf?

It works and I share it because I didn't find a solution on the web like this.

But there are 2 small issues that I couldn't fix so far.

  1. If there is a page with texts then pdf2svg will determine the texts as vector images and will generate an extra image with the texts. Is there any way to distinguish the text from the real vector images?
  2. If there are multiple vector images on one page then pdf2svg will generate one SVG image which contains all the vector images (same like a page contains text). Is it possible to extract them into separated images?

the bash script

#!/bin/bash

TMP_DIR=$1
SOURCE_PDF=$2
MAX_WIDTH=1920
MAX_HEIGHT=1080

echo "source: $SOURCE_PDF"


function burst
{
    local source=$1

    # explodes the pages to pdf files (it is necessary for the vector images export)
    `/usr/bin/pdftk $source burst`

    # removes the source pdf (we do not need it any more)
    `rm $source`

    # and the txt files which were generated by the pdftk
    `rm *.txt`
}


# finds the pages as pdf files and call check_for_images function
function process_pages {
    local tmp_dir=$1
    local pnum=1

    for f in `find . -type f -name "*.pdf"`
    do
        echo "processing page $f"
        check_for_images $f $pnum
        let "pnum++"
    done
}



function check_for_images {
    local pdf_page=$1
    local pnum=$2

    # checks whether the page contains a raster image
    list_raster_images=`/usr/bin/pdfimages -list $pdf_page | grep -E "(jpeg|png|gif)"`
    is_raster_images=${#list_raster_images}

    if (( $is_raster_images > 0 )); then
        # it contains raster image(s), extract them
        extract_raster_images $pdf_page $pnum
    else
        # it does not contain raster image(s), try to extract vector images
        extract_vector_images $pdf_page $pnum
    fi;

    rm $pdf_page
}


function extract_raster_images {
    local pdf_page=$1
    local pnum=$2

    pdf_file="${pdf_page%.*}"

    echo "extract all raster image(s) from this page";
    `/usr/bin/pdfimages -all $pdf_page ./`

    # we need to use a very same file name convention so this part renames them
    # who knows it might be useful later
    for f in `find . -regextype sed -regex ".*/-[0-9]\{3\}\.jpg"`
    do
        path=$(dirname $f)
        img_file=$(basename $f)
        img_ext="${img_file##*.}"
        img_num="${img_file%.*}"
        mv $f $path/$pdf_file$img_num.$img_ext
    done
}



function extract_vector_images {
    local pdf_page=$1
    local pnum=$2

    pdf_file="${pdf_page%.*}"

    echo "extract vector image from the page as SVG"
    `/usr/bin/pdf2svg $pdf_page $pdf_page.svg`

    # just to be sure it is not a raster image
    is_raster_image=`grep -c -i "data:image" $pdf_page.svg`
    if (( $is_raster_images == 0 )); then
        # convert SVG to PNG (it doesn't know JPG format) with fixed sizes, but keep the aspect ratio
        `/usr/bin/rsvg-convert -a -w $MAX_WIDTH -h $MAX_HEIGHT -f png -o $pdf_page.png $pdf_page.svg`
        # convert PNG to JPG
        `convert $pdf_page.png -background white -flatten -alpha off $pdf_file-000.jpg`
    fi;

    `rm *.svg`
    `rm *.png`
}


cd $TMP_DIR
burst $SOURCE_PDF
process_pages $TMP_DIR

executing it from php

$tmpName = basename($file['tmp_name']);
$tmpDir  = '/path-of-tmp-dir' . $tmpName . '_extraction';

mkdir($tmpDir);

$command = "extract_pdf_images.sh $tmpDir ".$file['tmp_name'];

exec($command);

requirements

apt-get install pdftk pdfimages pdf2svg librsvg2-bin imagick
Zoltán Süle
  • 1,482
  • 19
  • 26

0 Answers0