0

I am trying to extract the contents of a table within a pdf using PyPDF2 however I am encountering this error when trying to open the pdf and I am not sure why. How can I fix this? Here is the code:

#PDF Table testing
pdf_file = open(r"PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.pdf")
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(50)
page_content = page.extractText()
print(page_content.encode('utf-8'))

table_list = page_content.split('\n')
l = numpy.array_split(table_list, len(table_list)/7)
for i in range(0, 5):
    print(l[i])

This is the error:

PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
  File "C:/Users/benjh/Desktop/project/testing_regex.py", line 103, in <module>
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
  File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
    self.read(stream)
  File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
    stream.seek(-1, 2)
io.UnsupportedOperation: can't do nonzero end-relative seeks

What does nonzero end-relative seeks mean?

Dhar_
  • 71
  • 6

1 Answers1

0

Opening the pdf with 'rb' fixes the error

Dhar_
  • 71
  • 6
  • To be clear, it's almost certainly necessary for the code to work even beyond the seeking issue; a PDF reader needs to get the original bytes, not text data, so it should be in binary mode no matter what. – ShadowRanger Sep 17 '20 at 01:06