1

I am working on extracting data from a number of pdf documents in python, testing in colab. A solution would be great on colab, but also locally if that is not possible. There is a lot of interesting entries per page, so I chose tabula.

Code works great for most of the files, but crashes for others...

Can I import the missing .jar etc. somehow in colab, or if not, how to install it locally to run?

Thanks in advance!

Got stderr: Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider loadDiskCache
WARNING: New fonts found, font cache will be re-built
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Building on-disk font cache, this may take a while
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.FileSystemFontProvider <init>
WARNING: Finished building on-disk font cache, found 17 fonts
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont <init>
WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
Oct 26, 2021 5:54:00 AM org.apache.pdfbox.contentstream.PDFStreamEngine operatorException
SEVERE: Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed
... (multiple lines)

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-10-987da78e7e88> in <module>()
      2 regions = []
      3 for i in range(0,len(regions_raw)):
----> 4     regions.append(regions_raw[i]['data'][0][0]['text'])
      5 

IndexError: list index out of range

Code: (just one region printed, mostly from # https://towardsdatascience.com/how-to-extract-tables-from-pdf-using-python-pandas-and-tabula-py-c65e43bd754)

import tabula as tb
from tabula import read_pdf
import PyPDF2 # just for pagecount
from PyPDF2 import PdfFileReader

box = [2,0,4,13]
fc = 28.28       
for i in range(0, len(box)):
    box[i] *= fc

for filename in (files):
  pdftemp=open(filename,'rb')
  pdfReader = PyPDF2.PdfFileReader(pdftemp)
  pagestmp=pdfReader.getNumPages()
  pages=[i+3 for i in range(pagestmp-2)] #leave out first 2 pages

  regions_raw = tb.read_pdf(filename, pages=pages,area=[box],output_format="json")
  regions = []
  for i in range(0,len(regions_raw)):
      regions.append(regions_raw[i]['data'][0][0]['text'])

  print(regions)
  • yes, I saw the problem also here,,, https://stackoverflow.com/questions/63073663/java-error-while-reading-pdf-with-python-using-tabula . – Andreas Theil Oct 26 '21 at 06:50

1 Answers1

0

Oh, I´ve got it. Works, just some data starting one page later (on page 4). An empty entry in "data" crashes, causing the error.

  • However, the error continues to occur, but can be passed silently... try: regions_raw = tb.read_pdf(filename, pages=pages,area=[box],output_format="json") except OSError: pass – Andreas Theil Oct 26 '21 at 07:12