47

I am generating a PDF dynamically. How can I check the number of pages in the PDF using a shell script?

kiritsuku
  • 52,967
  • 18
  • 114
  • 136
Manish
  • 3,341
  • 15
  • 52
  • 87
  • 1
    Only using builtin shell commands? Or do you "allow" external tools like e.g. pdftk or pdfinfo? – Ocaso Protal Feb 05 '13 at 09:53
  • i m ok by any means but i need page number in a variable (shell script) so that i can pass this parameter to another function. – Manish Feb 06 '13 at 01:21
  • This question could be useful: (http://stackoverflow.com/questions/36655478/bash-routine-to-return-the-page-number-of-a-given-line-number-from-text-file) – Lacobus Apr 22 '16 at 18:47

11 Answers11

73

Without any extra package:

strings < file.pdf | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
    | sort -rn | head -n 1

Using pdfinfo:

pdfinfo file.pdf | awk '/^Pages:/ {print $2}'

Using pdftk:

pdftk file.pdf dump_data | grep NumberOfPages | awk '{print $2}'

You can also recursively sum the total number of pages in all PDFs via pdfinfo as follows:

find . -xdev -type f -name "*.pdf" -exec pdfinfo "{}" ";" | \
    awk '/^Pages:/ {n += $2} END {print n}'
Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
Ocaso Protal
  • 19,362
  • 8
  • 76
  • 83
  • 4
    I found that the shell only method is not always reliable. I have PDF files with only one Page having several /Count in them wit different numbers. I suggest using one othe other two methods. – Crami Jan 26 '18 at 12:25
  • @Crami thanks for the info! Is it possible that you share at least one of these PDFs? – Ocaso Protal Jan 26 '18 at 12:37
  • On Linux, **pdfinfo** (v0.12.4) does not print the correct number of pages: it says `12,052` while Adobe says `20,131`. The first method, however, does report the same number as Adobe. – Alexej Magura Nov 06 '18 at 01:12
  • @ShipluMokaddim It *is* super hacky, but you don't need any additional packages – Ocaso Protal Mar 21 '20 at 16:04
  • 1
    It's important to point out that the PDF count of pages may be affected by its inner objects compression. However, when it's not the case, the number of pages could be present after '.*/N' or '.*/Pages'. It's not trivial to find out which tag holds the correct value. But, the shell solution works well and is a great alternative to pdf's trailer dictionary search using pdfinfo – Kfcaio Aug 22 '20 at 00:11
  • 2
    You can get the number of pages without the need of `awk` by using the `\K` operator of `grep`. The command to execute would be `pdfinfo file.pdf | grep -Po 'Pages:[[:space:]]+\K[[:digit:]]+'`. – doltes Nov 29 '20 at 19:05
  • Here's another one with `pdftoppm`, which comes pre-installed on Ubuntu: https://stackoverflow.com/a/66963293/4561887. – Gabriel Staples Apr 06 '21 at 05:51
  • How does your `strings` solution work, by the way? Can you please explain it? I don't have any idea what is really contained in a PDF binary. – Gabriel Staples Apr 06 '21 at 05:52
9

The imagemagick library provides a tool called identify which in conjunction with counting the lines of output gets you what you are after...imagemagick is a easy install on osx with brew.

Here is a functional bash script that captures it to a shell variable and dumps it back to the screen...

#/bin/bash
pdfFile=$1
echo "Processing $pdfFile"
numberOfPages=$(/usr/local/bin/identify "$pdfFile" 2>/dev/null | wc -l | tr -d ' ')
#Identify gets info for each page, dump stderr to dev null
#count the lines of output
#trim the whitespace from the wc -l outout
echo "The number of pages is: $numberOfPages"

And the output of running it...

$ ./countPages.sh aSampleFile.pdf 
Processing aSampleFile.pdf
The number of pages is: 2
$ 
np0x
  • 166
  • 4
9

The pdftotext utility converts a pdf file to text format inserting page breaks between the pages. (aka: form-feed characters $'\f' ):

NAME
       pdftotext - Portable Document Format (PDF) to text converter.

SYNOPSIS
       pdftotext [options] [PDF-file [text-file]]

DESCRIPTION
       Pdftotext converts Portable Document Format (PDF) files to plain text.

       Pdftotext  reads  the PDF file, PDF-file, and writes a text file, text-file.  If text-file is
       not specified, pdftotext converts file.pdf to file.txt.  If text-file is  ´-',  the  text  is
       sent to stdout.

There are many combinations to solve your problem, choose one of them:

1) pdftotext + grep:

$ pdftotext file.pdf - | grep -c $'\f'

2) pdftotext + awk (v1):

$ pdftotext file.pdf - | awk 'BEGIN{n=0} {if(index($0,"\f")){n++}} END{print n}'

3) pdftotext + awk (v2):

$ pdftotext sample.pdf - | awk 'BEGIN{ RS="\f" } END{ print NR }'

4) pdftotext + awk (v3):

$ pdftotext sample.pdf - | awk -v RS="\f" 'END{ print NR }'

Hope it Helps!

Lacobus
  • 1,590
  • 12
  • 20
  • WATCH OUT! These different lines might give back different numbers! 1 and 2 gave me 264 on a file, but 3 and 4 returned 286. Not sure about the exact reason. – Sliq Jan 22 '21 at 18:51
8

Here is a version for the command line directly (based on pdfinfo):

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done
Marius Hofert
  • 6,546
  • 10
  • 48
  • 102
  • I love this, thank you. Here the filename is printed to the right of the number of pages: for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done – user2616155 Feb 03 '21 at 08:00
  • This is what I was looking for. Thanks. – Sajil C K Mar 12 '21 at 05:28
  • This one counts recursively in each folder and prints file name and # of pages. find -name "*.pdf" $1 | while read x; do pdfinfo "$x" | grep Pages | awk '{printf $2 }'; echo " $x"; done – Sajil C K Mar 12 '21 at 05:53
4

Here is a total hack using pdftoppm, which comes preinstalled on Ubuntu (tested on Ubuntu 18.04 and 20.04 at least):

# for a pdf withOUT a password
pdftoppm mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

# for a pdf WITH a password which is `1234`
pdftoppm -upw 1234 mypdf.pdf -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
| grep -o '[0-9]*'

How does this work? Well, if you specify a first page which is larger than the pages in the PDF (I specify page number 1000000, which is too large for all known PDFs), it will print the following error to stderr:

Wrong page range given: the first page (1000000) can not be after the last page (142).

So, I pipe that stderr msg to stdout with 2>&1, as explained here, then I pipe that to grep to match the (142). part with this regular expression (([0-9]*)\.$), then I pipe that to grep again with this regular expression ([0-9]*) to find just the number, which is 142 in this case. That's it!

Wrapper functions and speed testing

Here are a couple wrapper functions to test these:

# get the total number of pages in a PDF; technique 1.
# See this ans here: https://stackoverflow.com/a/14736593/4561887
# Usage (works on ALL PDFs--whether password-protected or not!):
#       num_pgs="$(getNumPgsInPdf "path/to/mypdf.pdf")"
# SUPER SLOW! Putting `time` just in front of the `strings` cmd shows it takes ~0.200 sec on a 142
# pg PDF!
getNumPgsInPdf() {
    _pdf="$1"

    _num_pgs="$(strings < "$_pdf" | sed -n 's|.*/Count -\{0,1\}\([0-9]\{1,\}\).*|\1|p' \
        | sort -rn | head -n 1)"

    echo "$_num_pgs"
}

# get the total number of pages in a PDF; technique 2.
# See my ans here: https://stackoverflow.com/a/66963293/4561887
# Usage, where `pw` is some password, if the PDF is password-protected (leave this off for PDFs
# with no password):
#       num_pgs="$(getNumPgsInPdf2 "path/to/mypdf.pdf" "pw")"
# SUPER FAST! Putting `time` just in front of the `pdftoppm` cmd shows it takes ~0.020 sec OR LESS
# on a 142 pg PDF!
getNumPgsInPdf2() {
    _pdf="$1"
    _password="$2"

    if [ -n "$_password" ]; then
        _password="-upw $_password"
    fi

    _num_pgs="$(pdftoppm $_password "$_pdf" -f 1000000 2>&1 | grep -o '([0-9]*)\.$' \
        | grep -o '[0-9]*')"

    echo "$_num_pgs"
}

Testing them with the time command in front shows that the strings one is extremely slow, taking ~0.200 sec on a 142 pg pdf, whereas the pdftoppm one is very fast, taking ~0.020 sec or less on the same pdf. The pdfinfo technique in Ocaso's answer below is also very fast--the same as the pdftoppm one.

See also

  1. These awesome answers by Ocaso Protal.
  2. These functions above will be used in my pdf2searchablepdf project here: https://github.com/ElectricRCAircraftGuy/PDF2SearchablePDF.
Gabriel Staples
  • 36,492
  • 15
  • 194
  • 265
  • 1
    This is light years faster than pdftk, important if you are calling this on a lot of PDFs on a dynamic web page. This is the best solution, IMO. – InterLinked Aug 02 '21 at 01:09
3

mupdf/mutool solution:

mutool info tmp.pdf | grep '^Pages' | cut -d ' ' -f 2
Farid Cheraghi
  • 847
  • 2
  • 12
  • 23
2

Just dug out an old script (in ksh) I found:

#!/usr/bin/env ksh
# Usage: pdfcount.sh file.pdf
#
# Optimally, this would be a mere:
#       pdfinfo file.pdf | grep Pages | sed 's/[^0-9]*//'

[[ "$#" != "1" ]] && {
   printf "ERROR: No file specified\n"
   exit 1
}

numpages=0
while read line; do
   num=${line/*([[:print:]])+(Count )?(-)+({1,4}(\d))*([[:print:]])/\4}
   (( num > numpages)) && numpages=$num
done < <(strings "$@" | grep "/Count")
print $numpages
ikaerom
  • 538
  • 5
  • 27
2

If you're on macOS you can query pdf metadata like this:

mdls -name kMDItemNumberOfPages -raw file.pdf

as seen here https://apple.stackexchange.com/questions/225175/get-number-of-pdf-pages-in-terminal

Gerrit Griebel
  • 405
  • 3
  • 10
2

Another mutool solution making better use of the options:

mutool show file.pdf Root/Pages/Count

cotrane
  • 149
  • 4
1

I made a few improvement in Marius Hofert tip to sum the returned values.

for f in *.pdf; do pdfinfo "$f" | grep Pages | awk '{print $2}'; done | awk '{s+=$1}END{print s}'
  • 1
    Not the downvoter, but I suspect that the reason your answer is receiving negative attention is that it would have been better left as a comment on the answer you reference. – Brian61354270 Feb 17 '20 at 18:14
  • Yes, I know. The problem is I am new here, and stackoverflow only allows to comment with 50 reputation score. I still don't have that. – Leonardo Sapiras Feb 18 '20 at 14:33
0

To build on Marius Hofert's answer, this command uses a bash for loop to show you the number of pages, display the filename, and it will ignore the case of the file extension.

for f in *.[pP][dD][fF]; do pdfinfo "$f" | grep Pages | awk '{printf $2 }'; echo " $f"; done
user2616155
  • 330
  • 2
  • 9
  • I think the use of both `grep` and `awk` in a pipeline is a bit of an overdo, it's better to use `awk` solely, which reduces the pipe count by one. Also, use `shopt -s nocaseglob` to ignore the file extension's case instead of entering every capital letter manually. – Łukasz Rajchel Jun 05 '23 at 12:39