I needed a bash script which extracts all the raster and vector images from the pdf and convert them to jpg format.
I checked many posts on the web and I got most of the ideas from these
How can I extract images from a PDF file?
Count the number of the raster images in the pdf
How to extract a vector figure from pdf?
It works and I share it because I didn't find a solution on the web like this.
But there are 2 small issues that I couldn't fix so far.
- If there is a page with texts then
pdf2svg
will determine the texts as vector images and will generate an extra image with the texts. Is there any way to distinguish the text from the real vector images? - If there are multiple vector images on one page then
pdf2svg
will generate one SVG image which contains all the vector images (same like a page contains text). Is it possible to extract them into separated images?
the bash script
#!/bin/bash
TMP_DIR=$1
SOURCE_PDF=$2
MAX_WIDTH=1920
MAX_HEIGHT=1080
echo "source: $SOURCE_PDF"
function burst
{
local source=$1
# explodes the pages to pdf files (it is necessary for the vector images export)
`/usr/bin/pdftk $source burst`
# removes the source pdf (we do not need it any more)
`rm $source`
# and the txt files which were generated by the pdftk
`rm *.txt`
}
# finds the pages as pdf files and call check_for_images function
function process_pages {
local tmp_dir=$1
local pnum=1
for f in `find . -type f -name "*.pdf"`
do
echo "processing page $f"
check_for_images $f $pnum
let "pnum++"
done
}
function check_for_images {
local pdf_page=$1
local pnum=$2
# checks whether the page contains a raster image
list_raster_images=`/usr/bin/pdfimages -list $pdf_page | grep -E "(jpeg|png|gif)"`
is_raster_images=${#list_raster_images}
if (( $is_raster_images > 0 )); then
# it contains raster image(s), extract them
extract_raster_images $pdf_page $pnum
else
# it does not contain raster image(s), try to extract vector images
extract_vector_images $pdf_page $pnum
fi;
rm $pdf_page
}
function extract_raster_images {
local pdf_page=$1
local pnum=$2
pdf_file="${pdf_page%.*}"
echo "extract all raster image(s) from this page";
`/usr/bin/pdfimages -all $pdf_page ./`
# we need to use a very same file name convention so this part renames them
# who knows it might be useful later
for f in `find . -regextype sed -regex ".*/-[0-9]\{3\}\.jpg"`
do
path=$(dirname $f)
img_file=$(basename $f)
img_ext="${img_file##*.}"
img_num="${img_file%.*}"
mv $f $path/$pdf_file$img_num.$img_ext
done
}
function extract_vector_images {
local pdf_page=$1
local pnum=$2
pdf_file="${pdf_page%.*}"
echo "extract vector image from the page as SVG"
`/usr/bin/pdf2svg $pdf_page $pdf_page.svg`
# just to be sure it is not a raster image
is_raster_image=`grep -c -i "data:image" $pdf_page.svg`
if (( $is_raster_images == 0 )); then
# convert SVG to PNG (it doesn't know JPG format) with fixed sizes, but keep the aspect ratio
`/usr/bin/rsvg-convert -a -w $MAX_WIDTH -h $MAX_HEIGHT -f png -o $pdf_page.png $pdf_page.svg`
# convert PNG to JPG
`convert $pdf_page.png -background white -flatten -alpha off $pdf_file-000.jpg`
fi;
`rm *.svg`
`rm *.png`
}
cd $TMP_DIR
burst $SOURCE_PDF
process_pages $TMP_DIR
executing it from php
$tmpName = basename($file['tmp_name']);
$tmpDir = '/path-of-tmp-dir' . $tmpName . '_extraction';
mkdir($tmpDir);
$command = "extract_pdf_images.sh $tmpDir ".$file['tmp_name'];
exec($command);
requirements
apt-get install pdftk pdfimages pdf2svg librsvg2-bin imagick