How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2

Question

In order to get a single string from a multi-paged PDF I'm doing this:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    output = page.extractText()
output

The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.

Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.

Any help would be appreciated.

Update:

As suggested, I defined an output as a list and then appended to it (as I thought) all pages in a loop like this:

for i in range(count):
    page = pdfReader.getPage(i)
    output = []
    output.append(page.extractText())

The result, thought, is a single string in the list like ['sample content from the last page of PDF']

@AMC I guess... But it's impossible to `concat str to bytes` — Gavrk, Feb 13 '20 at 02:05
I'm not sure I understand how that relates to my question, sorry. — AMC, Feb 13 '20 at 02:07
@AMC If I use `output += page.extractText()` to avoid overwriting, as suggested below, I get `TypeError: can't concat str to bytes` — Gavrk, Feb 13 '20 at 02:12
How do you define `output`? In any case, what I had in mind was using something like a list. — AMC, Feb 13 '20 at 02:15
@AMC As a string. Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if `getPage` takes a single page number as an argument? — Gavrk, Feb 13 '20 at 02:23
_As a string._ Then that explains the error, right? _Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if getPage takes a single page number as an argument?_ All I meant is that could define `output` as a list and then append the result of `page.extractText()` where you're currently assigning it to `output`. — AMC, Feb 13 '20 at 02:25
@AMC Thank you, but it creates list with a single string like `['sample content from the last page of PDF']`. How can I loop over the whole range of pages? I posted that piece of code in the question update. — Gavrk, Feb 13 '20 at 02:41
Look at where you defined the list, it’s a similar issue to the first one. — AMC, Feb 13 '20 at 02:50
@AMC Sure! Certainly I'm not the only beginner who does not know how to loop properly =) — Gavrk, Feb 13 '20 at 05:48

Thaer A · Accepted Answer · 2020-02-13T02:47:10.573

6

Could it be because of this line:

output = page.extractText()

Try this instead:

output += page.extractText()

Because in your code, you're overwriting the value of the "output" variable instead of appending to it. Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count):

edited Feb 13 '20 at 02:47

answered Feb 13 '20 at 01:14

Thaer A

2,243
1
10
14

2

Thank you! Apparently, yes. `TypeError: can't concat str to bytes` This is an error I get. As I understand, this is because I take 'rb' as an argument for 'open'. But then `PdfFileReader stream/file object is not in binary mode` Is there an option to convert bytes to string some other way? – Gavrk Feb 13 '20 at 02:04
2

What are you trying to do? To write the output to a text file: with open('sample.txt', 'w') as f: f.writelines(output) Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count): – Thaer A Feb 13 '20 at 02:44

izzleee · Answer 2 · 2020-02-15T07:54:57.210

This code works:

import os, glob, PyPDF2, sys

file_path = 'C:/Users/ipeter/Desktop/Webdriverdownloads'
read_files = glob.glob(os.path.join(file_path,'*.pdf'))

for files in read_files:
    pdfReader = PyPDF2.PdfFileReader(files)
    count = pdfReader.numPages
    output = []
    for i in range(count):
        page = pdfReader.getPage(i)
        output.append(page.extractText())
    print(output)

The first loop reads all files in a folder. The second loop reads all pages in the pdf.

output[0] = pdfpage1
output[1] = pdfpage2
output[2] = pdfpage3

... etc

If you need entire pdf in one string you can save newoutput use join function:

seperator = ','
newoutput = seperator.join(output)

or simplify:

newoutput = ','.join(output)

score 3 · Answer 3 · answered Feb 13 '20 at 20:44

You're overwriting the output variable each time.

While you could concatenate the bytes together using output +=, it's probably safer to use a list instead, in which case you would have output = [] defined outside the loop, and replace output = page.extractText() with output.append(page.extractTest()).

score 1 · Answer 4 · answered Sep 10 '21 at 08:30

1

Try to create output as empty string first..

output = ""
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    output += pageObj.extractText()

answered Sep 10 '21 at 08:30

bitbang

1,804
14
18

How to retrieve ALL pages from PDF as a single string in Python 3 using PyPDF2

4 Answers4