I am trying to extract the contents of a table within a pdf using PyPDF2 however I am encountering this error when trying to open the pdf and I am not sure why. How can I fix this? Here is the code:
#PDF Table testing
pdf_file = open(r"PDFs/murrumbidgee/Murrumbidgee Unregulated River Water Sources 2012_20200815.pdf")
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(50)
page_content = page.extractText()
print(page_content.encode('utf-8'))
table_list = page_content.split('\n')
l = numpy.array_split(table_list, len(table_list)/7)
for i in range(0, 5):
print(l[i])
This is the error:
PdfReadWarning: PdfFileReader stream/file object is not in binary mode. It may not be read correctly. [pdf.py:1079]
Traceback (most recent call last):
File "C:/Users/benjh/Desktop/project/testing_regex.py", line 103, in <module>
read_pdf = PyPDF2.PdfFileReader(pdf_file)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1084, in __init__
self.read(stream)
File "C:\Users\benjh\anaconda3\envs\project\lib\site-packages\PyPDF2\pdf.py", line 1689, in read
stream.seek(-1, 2)
io.UnsupportedOperation: can't do nonzero end-relative seeks
What does nonzero end-relative seeks mean?