I am working on extracting data from a number of pdf documents in python, testing in colab. A solution would be great on colab, but also locally if that is not possible. There is a lot of interesting entries per page, so I chose tabula.
Code works great for most of the files, but crashes for others...
Can I import the missing .jar etc. somehow in colab, or if not, how to install it locally to run?
Thanks in advance!
Got stderr: Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 17 fonts
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
... (multiple lines)
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-10-987da78e7e88> in <module>()
2 regions = []
3 for i in range(0,len(regions_raw)):
----> 4 regions.append(regions_raw[i]['data'][0][0]['text'])
5
IndexError: list index out of range
Code: (just one region printed, mostly from # https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754)
import tabula as tb
from tabula import read_pdf
import PyPDF2 # just for pagecount
from PyPDF2 import PdfFileReader
box = [2,0,4,13]
fc = 28.28
for i in range(0, len(box)):
box[i] *= fc
for filename in (files):
pdftemp=open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdftemp)
pagestmp=pdfReader.getNumPages()
pages=[i+3 for i in range(pagestmp-2)] #leave out first 2 pages
regions_raw = tb.read_pdf(filename, pages=pages,area=[box],output_format="json")
regions = []
for i in range(0,len(regions_raw)):
regions.append(regions_raw[i]['data'][0][0]['text'])
print(regions)