I am using this code:
from PyPDF2 import PdfFileReader
def text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# get the first page
page = pdf.getPage(0)
print(page)
print('Page type: {}'.format(str(type(page))))
text = page.extractText()
print(text)
if __name__ == '__main__':
path = 'XEROX.pdf'
text_extractor(path)
But this return me:
{'/Type': '/Page', '/MediaBox': [0, 0, 612, 792], '/Parent': IndirectObject(3, 0),
'/Resources': {'/ProcSet': ['/PDF', '/ImageB', '/Text'],
'/ExtGState': IndirectObject(47, 0), '/Font': IndirectObject(48, 0)},
'/Contents': IndirectObject(5, 0)}
Page type: <class 'PyPDF2.pdf.PageObject'>
!ˆ"#$
[Finished in 0.9s]
Where is the data?
I think that this pdf has binary symbols instead of ascii. How can I read this information in ascii or string type?
This is the result when I apply copy and paste in the PDF' information: