4

In order to get a single string from a multi-paged PDF I'm doing this:

import PyPDF2
pdfFileObject = open('sample.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
count = pdfReader.numPages
for i in range(count):
    page = pdfReader.getPage(i)
    output = page.extractText()
output

The result is a string from a single page (the last page in the document) - just as it should be according to the PyPDF2 documentation. I applied this method because I've read some people suggesting it to read whole PDF, what does not work in my case.

Obviously, this is a basic operation, and I apologize in advance for my lack of experience. I tried other solutions like Tika, PDFMiner and Textract, but PyPDF seems to be the only one letting me so far.

Any help would be appreciated.

Update:

As suggested, I defined an output as a list and then appended to it (as I thought) all pages in a loop like this:

for i in range(count):
    page = pdfReader.getPage(i)
    output = []
    output.append(page.extractText())

The result, thought, is a single string in the list like ['sample content from the last page of PDF']

Gavrk
  • 295
  • 1
  • 4
  • 16
  • 1
    Aren't you overwriting `output` every time? – AMC Feb 13 '20 at 01:17
  • @AMC I guess... But it's impossible to `concat str to bytes` – Gavrk Feb 13 '20 at 02:05
  • 1
    I'm not sure I understand how that relates to my question, sorry. – AMC Feb 13 '20 at 02:07
  • @AMC If I use `output += page.extractText()` to avoid overwriting, as suggested below, I get `TypeError: can't concat str to bytes` – Gavrk Feb 13 '20 at 02:12
  • 1
    How do you define `output`? In any case, what I had in mind was using something like a list. – AMC Feb 13 '20 at 02:15
  • @AMC As a string. Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if `getPage` takes a single page number as an argument? – Gavrk Feb 13 '20 at 02:23
  • 1
    _As a string._ Then that explains the error, right? _Sorry, I don't quite understand. You mean to get an output as a list of strings retrieved from each page? How to get such a list if getPage takes a single page number as an argument?_ All I meant is that could define `output` as a list and then append the result of `page.extractText()` where you're currently assigning it to `output`. – AMC Feb 13 '20 at 02:25
  • @AMC Thank you, but it creates list with a single string like `['sample content from the last page of PDF']`. How can I loop over the whole range of pages? I posted that piece of code in the question update. – Gavrk Feb 13 '20 at 02:41
  • 1
    Look at where you defined the list, it’s a similar issue to the first one. – AMC Feb 13 '20 at 02:50
  • Do you want me to post an answer? – AMC Feb 13 '20 at 03:44
  • @AMC Sure! Certainly I'm not the only beginner who does not know how to loop properly =) – Gavrk Feb 13 '20 at 05:48
  • 1
    Done! Let me know if you want me to expand on any area. – AMC Feb 14 '20 at 01:15

4 Answers4

6

Could it be because of this line:

output = page.extractText()

Try this instead:

output += page.extractText()

Because in your code, you're overwriting the value of the "output" variable instead of appending to it. Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count):

Thaer A
  • 2,243
  • 1
  • 10
  • 14
  • 2
    Thank you! Apparently, yes. `TypeError: can't concat str to bytes` This is an error I get. As I understand, this is because I take 'rb' as an argument for 'open'. But then `PdfFileReader stream/file object is not in binary mode` Is there an option to convert bytes to string some other way? – Gavrk Feb 13 '20 at 02:04
  • 2
    What are you trying to do? To write the output to a text file: with open('sample.txt', 'w') as f: f.writelines(output) Don't forget to declare the "output" variable before the for loop. So output = '' before for i in range(count): – Thaer A Feb 13 '20 at 02:44
4

This code works:

import os, glob, PyPDF2, sys

file_path = 'C:/Users/ipeter/Desktop/Webdriverdownloads'
read_files = glob.glob(os.path.join(file_path,'*.pdf'))

for files in read_files:
    pdfReader = PyPDF2.PdfFileReader(files)
    count = pdfReader.numPages
    output = []
    for i in range(count):
        page = pdfReader.getPage(i)
        output.append(page.extractText())
    print(output)

The first loop reads all files in a folder. The second loop reads all pages in the pdf.

output[0] = pdfpage1
output[1] = pdfpage2
output[2] = pdfpage3

... etc

If you need entire pdf in one string you can save newoutput use join function:

seperator = ','
newoutput = seperator.join(output)

or simplify:

newoutput = ','.join(output)
izzleee
  • 315
  • 3
  • 11
3

You're overwriting the output variable each time.

While you could concatenate the bytes together using output +=, it's probably safer to use a list instead, in which case you would have output = [] defined outside the loop, and replace output = page.extractText() with output.append(page.extractTest()).

AMC
  • 2,642
  • 7
  • 13
  • 35
1

Try to create output as empty string first..

output = ""
for i in range(pdfReader.numPages):
    pageObj = pdfReader.getPage(i)
    output += pageObj.extractText()
bitbang
  • 1,804
  • 14
  • 18