2

I Want to parse a PDF file with pdfminer and tabula

I read this question and I use this code:

from pdfminer.pdfparser import PDFParser from pdfminer.pdfdocument import PDFDocument

import magic
from pyPdf import PdfFileWriter, PdfFileReader
import tabula
import numpy as np
filename = '/home/parser/test.pdf'
magic.from_file(filename,mime=True)

ifpdf = PdfFileReader(file(filename, "rb"))

pdf_info = ifpdf.getDocumentInfo()

nm = [ 'Info_1', 'Info_2','Info_3','Info_4']
df = tabula.read_pdf(filename,pages="all",lattice="all",pandas_options={'header': None,'names':nm,'encoding':'utf-8'})

df.refenseigne.replace(to_replace=r"(M|C)\r",value="",regex=True,inplace=True)
df.to_csv("test.csv",encoding="utf-8")

When I execute my code I get this error

Traceback (most recent call last):
  File "parse_pdf.py", line 16, in <module>
    df = tabula.read_pdf(filename,pages="all",lattice="all",pandas_options={'header': None,'names':nm,'encoding':'utf-8'})
  File "/usr/local/lib/python2.7/dist-packages/tabula/wrapper.py", line 87, in read_pdf
    output = subprocess.check_output(args)
  File "/usr/lib/python2.7/subprocess.py", line 567, in check_output
    process = Popen(stdout=PIPE, *popenargs, **kwargs)
  File "/usr/lib/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib/python2.7/subprocess.py", line 1343, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

what's weird is that on line 9 and 11 I can find the file, but on line 16 I have this error.

Am I wrong or is it a tabula problem?

parik
  • 2,313
  • 12
  • 39
  • 67
  • Does it work with only tabula-py? To distinguish what is the root cause, you can write minimum code and then add some other staff. – chezou Sep 11 '18 at 11:51
  • @chezou non it doesn't work with tabula-py. I wrote already the minimum code ! – parik Sep 11 '18 at 12:04
  • I tried to run without magic and it works almost fine til read_pdf. Just confirmation, do you mean writing the most simple code like `import tabula; tabula.read_pdf(filename)` also doesn't work? Or, you mean tabula-py doesn't work with pdfminer? – chezou Sep 11 '18 at 12:10
  • FYI, the latest code on master branch introduces handling File like object and path libs. I hope it works fine for your case. – chezou Sep 11 '18 at 12:12
  • @chezou the part that doesn't work is tabula.read_pdf, it can't find the pdf file, it was the same for my co-workers, – parik Sep 11 '18 at 12:16
  • I can't reproduce your issue. https://gist.github.com/chezou/15b8a7a408808b3e9386f724c36653d9 If I add importing pdfminer, it also works. I guess your file path might include space or special character. Could you try latest version? It can handle file like object like this: https://github.com/chezou/tabula-py/blob/master/tests/test_read_pdf_table.py#L48-L49 – chezou Sep 11 '18 at 13:36
  • I will try the latest version and I'll tell you, the last time i tried 1 month ago, and for the path, I'm not stupid :) – parik Sep 11 '18 at 13:42
  • Note that, the latest version is not published on Pypi yet. Please install via `pip install git+https://github.com/chezou/tabula-py` – chezou Sep 11 '18 at 13:46

1 Answers1

0

I faced this same issue in Ubuntu.

First, check the version of the JDK and JRE that are installed on your machine by running java --version and javac --version. Each should have a version greater than 7.

Then use pip3 to install tabula.

it started reading but showd following error

WARNING: Using fallback font 'LiberationSerif' for 'TimesNewRomanPSMT'
May 10, 2019 12:36:29 PM org.apache.pdfbox.pdmodel.font.PDTrueTypeFont 
Ganesh Kharad
  • 333
  • 2
  • 6