Using Python to pull the number of pages in all the pdf documents in a directory

Question

I am trying to use PyPDF2 to grab the number of pages of every pdf in a directory. I can use .getNumPages() to find the number of pages in one pdf file but I need to walk through a directory and get the number of pages for every file. Any ideas?

Here is the code I have so far:

import pandas as pd
import os
from PyPDF2 import PdfFileReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
pdf=PdfFileReader(open('path/to/file.pdf','rb'))
for root, dirs, files in os.walk(r'Directory path'):
    for file in files:
        if file.endswith(".pdf"):
            df2 = pd.DataFrame([[file, os.path.join(root,file),pdf.getNumPages()]], columns=['fileName', 'fileLocation', 'pageNumber'])
            df = df.append(df2, ignore_index=True)

This code will just add the number of pages from the first PDF file in the directory to the dataframe. If I try to add a directory path to PdfFilereader() I get a

PermissionError:[Errno 13] Permission denied.

Have you attempted to do this yourself first? If so, you should post your code and then ask for help. StackOverflow isn't a place to get people to do your work for you! — mrpopo, Mar 17 '17 at 14:07
mrpopo I do appreciate that aspect of SO but he does only need two lines of code so maybe we can make an exception :) — Ben Quigley, Mar 17 '17 at 14:19
I'm new to StackOverflow! I edited my post and added my code. — Zfrieden, Mar 17 '17 at 14:22
Try replacing "file" with "f". I don't think it's causing the problem, but it is a python reserved word. — Ben Quigley, Mar 17 '17 at 14:47

Ben Quigley · Accepted Answer · 2017-03-17T15:27:07.173

3

Yeah, use

import glob
list_of_pdf_filenames = glob.glob('*pdf')

to return the list of all PDF filenames in a directory.

**Edit: **

By placing the open() statement inside the loop, I was able to get this code to run on my computer:

import pandas as pd
import os
from PyPDF2 import PdfFileReader
df = pd.DataFrame(columns=['fileName', 'fileLocation', 'pageNumber'])
for root, dirs, files in os.walk(r'/home/benjamin/docs/'):
    for f in files:
        if f.endswith(".pdf"):
            pdf=PdfFileReader(open(os.path.join(root, f),'rb'))
            df2 = pd.DataFrame([[f, os.path.join(root,f), pdf.getNumPages()]], columns=['fileName', 'fileLocation', 'pageNumber'])
            df = df.append(df2, ignore_index=True)
print(df.head)

edited Mar 17 '17 at 15:27

answered Mar 17 '17 at 14:18

Ben Quigley

727
4
18

Thank you for your help! I can find the list of all PDF filenames in the directory no problem. I'm having trouble finding the number of pages in these PDF files in the directory. – Zfrieden Mar 17 '17 at 14:26
@Dillanm That is what I have been using. I just cant seem to figure out how to use that and iterate through a directory to get the number of pages for each PDF file. – Zfrieden Mar 17 '17 at 14:48
@Zfrieden what is the actual value that you're using for 'Directory path'? And, if you comment out the last two lines for a moment and instead just write `print(root, file)`, what does it print? – Ben Quigley Mar 17 '17 at 14:49
@Benjamin the 'Directory path' is a path to a local folder on my desktop. If I print(root,file) the output is every file name in the folder with the file path. – Zfrieden Mar 17 '17 at 14:58
Why is the `open()` statement not inside the loop? I would think that you would want to individually open each PDF in order to read its page numbers, right? – Ben Quigley Mar 17 '17 at 15:09
In other words, for each PDF file in the loop, you are trying to make a dataframe with `pdf.getNumPages()`. What is `pdf`? It's defined before the loop, so it's a constant: a PDF file reader with a single, open file: `PdfFileReader(open('path/to/file.pdf','rb'))`. Whatever's in there, it's not changing for each file in the loop; so it's not anything to do with the files you're trying to have it look at. – Ben Quigley Mar 17 '17 at 15:17
@Benjamin You're right the open() statement should be inside the loop but I am still unable to put the directory path in the open() statement. I get a permission error when i try to do this. If I just put a path to one file in the directory in the open() statement the column 'pageNumber' in the data frame will just have the number of pages in one PDF listed continuously. – Zfrieden Mar 17 '17 at 15:18
@Benjamin Your edit worked perfectly. Thank you for your help! – Zfrieden Mar 17 '17 at 16:05

score 0 · Answer 2 · answered Nov 20 '19 at 13:27

step 1:-

pip install pyPDF2

step 2:-

import requests, PyPDF2, io
url = 'sample.pdf' 
response = requests.get(url)
with io.BytesIO(response.content) as open_pdf_file:
  read_pdf = PyPDF2.PdfFileReader(open_pdf_file)
  num_pages = read_pdf.getNumPages()
  print(num_pages)

Using Python to pull the number of pages in all the pdf documents in a directory

2 Answers2

Linked