3

Can anyone help me out?

Thanks in Advance.

Code :

from PyPDF2 import PdfFileReader

def text_extractor(path):    
    with open(path, 'rb') as f:

        pdf = PdfFileReader(f)
        page = pdf.getPage(2)
        print(page)
        text = page.extractText().encode('utf-8')
        print(text)

if __name__ == '__main__':

    path = '/home/ubuntu/Desktop/hi.pdf'
    text_extractor(path)

Output :

{'/Parent': IndirectObject(137, 0), '/CropBox': [0, 0, 960, 540], '/Rotate': 0, '/Resources': {'/ColorSpace': {'/CS0': IndirectObject(155, 0)}, '/XObject': {'/Im0': IndirectObject(6, 0), '/Im1': IndirectObject(8, 0)}, '/Font': {'/TT1': IndirectObject(132, 0), '/TT0': IndirectObject(157, 0), '/TT2': IndirectObject(159, 0)}, '/ProcSet': ['/PDF', '/Text', '/ImageC']}, '/Contents': IndirectObject(5, 0), '/MediaBox': [0, 0, 960, 540], '/Type': '/Page'}

b'65#-\'\n!C,%03D\n!9$*0&30%30\n!E$34&,%&$AA(#6$/#,%\n!F0?860?&3$-A(#%:\n!G+$/&2$"#$H(0I($40"&@#((&4,8&830\n!G+$/&#3&4,8&(#-#/#%:&2$"#$H(0\n!J,@&/,&+$%?(0K&E20"4/+#%:&0(30\n'

Moshe Slavin
  • 5,127
  • 5
  • 23
  • 38
sridhar er
  • 124
  • 7
  • Does this answer your question? [How to extract text from a PDF file?](https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file) – Martin Thoma May 08 '22 at 11:24

2 Answers2

1

You threw me off a bit with the problem statement but it is actually much more basic than you indicated. You are explicitly requesting a sequence of bytes by using the encode. Please look at the official documentation for encoding.

From the documentation:

The rules for translating a Unicode string into a sequence of bytes are called an encoding.

If you for some reason need a string of bytes the opposite is the decode which gives you UTF-8 by default. This should not be necessary in your case as the docs state that you should get a Unicode string from the extractText() command.

Edit: Clarified the further information on decoding.

NacMacFeegle
  • 191
  • 10
  • Thanks. But it still gives the same output after replacing encode with decode. – sridhar er Aug 06 '18 at 04:43
  • 1
    Sorry for the confusion. I have edited my answer. You should not need either encode or decode as extractText() should return a unicode string object. If the string object you get looks like that (and I assume that it does not correspond to the actual text in the document) then I suggest you try it on a different document. See this SO question for a very similar situation: https://stackoverflow.com/questions/34837707/extracting-text-from-a-pdf-file-using-python – NacMacFeegle Aug 08 '18 at 13:22
-1
f=open('full file path','rb')

pdf_text=[]#text goes here
pdf_reader=PyPDF2.PdfFileReader(f)
for num in range(pdf_reader.numPages):
    page=pdf_reader.getPage(num)
    pdf_text.append(page.extractText())#here'()' makes the difference be careful without this output will be in bytecode

print(pdf_text)

Maybe this will help you.

Ruli
  • 2,592
  • 12
  • 30
  • 40
  • 1
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Dec 29 '21 at 16:09