How to read pdf files one by one from a folder in python

Question

I am reading pdf files and trying to extract keywords from them through NLP techniques.Right now the program accepts one pdf at a time. I have a folder say in D drive named 'pdf_docs'. The folder contains many pdf documents. My goal is to read each pdf file one by one from the folder. How can I do that in python. The code so far working successfully is like below.

import PyPDF2

file = open('abc.pdf','rb')


fileReader = PyPDF2.PdfFileReader(file)

count = 0

while count < 3:

    pageObj = fileReader.getPage(count)
    count +=1
    text = pageObj.extractText()

Possible duplicate of [How can I iterate over files in a given directory?](https://stackoverflow.com/questions/10377998/how-can-i-iterate-over-files-in-a-given-directory) — tripleee, Oct 28 '18 at 09:45

Raoslaw Szamszur · Accepted Answer · 2018-10-28T09:50:01.437

First read all files that are available under that directory

from os import listdir
from os.path import isfile, join
onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]

And then run your code for each file in that list

import PyPDF2
from os import listdir
from os.path import isfile, join


onlyfiles = [f for f in listdir(mypath) if isfile(join(mypath, f))]
for file in onlyfiles:
    fileReader = PyPDF2.PdfFileReader(open(file,'rb'))

    count = 0

    while count < 3:

        pageObj = fileReader.getPage(count)
        count +=1
        text = pageObj.extractText()

os.listdir() will get you everything that's in a directory - files and directories. So be careful to have only pdf files in your path or you will need to implement simple filtration for list.

Edit 1

You can also use glob module, as it does pattern matching.

>>> import glob
>>> print(glob.glob('/home/rszamszur/*.sh'))
['/home/rszamszur/work-monitors.sh', '/home/rszamszur/default-monitor.sh', '/home/rszamszur/home-monitors.sh']

Key difference between OS module and glob is that OS will work for all systems, where glob only for Unix like.

score 1 · Answer 2 · answered Oct 28 '18 at 09:50

1

you can use glob in order use pattern matching for getting a list of all pdf files in your directory.

import glob

pdf_dir = "/foo/dir"

pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
for file in pdf_files:
    do_your_stuff()

answered Oct 28 '18 at 09:50

olisch

960
6
11

sudhagar narayanan · Answer 3 · 2019-04-07T06:42:07.440

import PyPDF2
import re
import glob

#your full path of directory
mypath = "dir"
for file in glob.glob(mypath + "/*.pdf"):
    print(file)
    if file.endswith('.pdf'):
        fileReader = PyPDF2.PdfFileReader(open(file, "rb"))
        count = 0
        count = fileReader.numPages
        while count >= 0:
            count -= 1
            pageObj = fileReader.getPage(count)
            text = pageObj.extractText()
            print(text)
        num = re.findall(r'[0-9]+', text)
        print(num)
    else:
        print("not in format")

Let's go through the code: In python we can't handle Pdf files normally. so we need to install PyPDF2 package then import the package. "glob" function is used to read the files inside the directory. using "for" loop to get the files inside the folder. now check the file type is it in pdf format or not by using "if" condition. now we are reading the pdf files in the folder using "PdfFileReader"function. then getting number of pages in the pdf document. By using while loop to getting all pages and print all text in the file.

you should explain your Answer with more details – Azzabi Haythem Apr 06 '19 at 21:26 — Azzabi Haythem, Apr 06 '19 at 21:26

How to read pdf files one by one from a folder in python

3 Answers3

Edit 1