Unable to read pdf file using Pypdf. Its showing output in bytecode

Question

Can anyone help me out?

Thanks in Advance.

Code :

from PyPDF2 import PdfFileReader

def text_extractor(path):    
    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)
        page = pdf.getPage(2)
        print(page)
        text = page.extractText().encode('utf-8')
        print(text)

if __name__ == '__main__':

    path = '/home/ubuntu/Desktop/hi.pdf'
    text_extractor(path)

Output :

{'/Parent': IndirectObject(137, 0), '/CropBox': [0, 0, 960, 540], '/Rotate': 0, '/Resources': {'/ColorSpace': {'/CS0': IndirectObject(155, 0)}, '/XObject': {'/Im0': IndirectObject(6, 0), '/Im1': IndirectObject(8, 0)}, '/Font': {'/TT1': IndirectObject(132, 0), '/TT0': IndirectObject(157, 0), '/TT2': IndirectObject(159, 0)}, '/ProcSet': ['/PDF', '/Text', '/ImageC']}, '/Contents': IndirectObject(5, 0), '/MediaBox': [0, 0, 960, 540], '/Type': '/Page'}

b'65#-\'\n!C,%03D\n!9$*0&30%30\n!E$34&,%&$AA(#6$/#,%\n!F0?860?&3$-A(#%:\n!G+$/&2$"#$H(0I($40"&@#((&4,8&830\n!G+$/&#3&4,8&(#-#/#%:&2$"#$H(0\n!J,@&/,&+$%?(0K&E20"4/+#%:&0(30\n'

Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) — Martin Thoma, May 08 '22 at 11:24

NacMacFeegle · Answer 1 · 2018-08-08T13:12:00.930

1

You threw me off a bit with the problem statement but it is actually much more basic than you indicated. You are explicitly requesting a sequence of bytes by using the encode. Please look at the official documentation for encoding.

From the documentation:

The rules for translating a Unicode string into a sequence of bytes are called an encoding.

If you for some reason need a string of bytes the opposite is the decode which gives you UTF-8 by default. This should not be necessary in your case as the docs state that you should get a Unicode string from the extractText() command.

Edit: Clarified the further information on decoding.

edited Aug 08 '18 at 13:12

answered Aug 03 '18 at 13:37

NacMacFeegle

191
10

Thanks. But it still gives the same output after replacing encode with decode. – sridhar er Aug 06 '18 at 04:43
1

Sorry for the confusion. I have edited my answer. You should not need either encode or decode as extractText() should return a unicode string object. If the string object you get looks like that (and I assume that it does not correspond to the actual text in the document) then I suggest you try it on a different document. See this SO question for a very similar situation: https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python – NacMacFeegle Aug 08 '18 at 13:22

score -1 · Answer 2 · edited Dec 29 '21 at 16:09

-1

f=open('full file path','rb')

pdf_text=[]#text goes here
pdf_reader=PyPDF2.PdfFileReader(f)
for num in range(pdf_reader.numPages):
    page=pdf_reader.getPage(num)
    pdf_text.append(page.extractText())#here'()' makes the difference be careful without this output will be in bytecode

print(pdf_text)

Maybe this will help you.

edited Dec 29 '21 at 16:09

Ruli

2,592
12
30
40

answered Dec 29 '21 at 15:16

Abhishek Farande

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 29 '21 at 16:09

Unable to read pdf file using Pypdf. Its showing output in bytecode

2 Answers2