-1

I'm trying to extract table from some pdf by tabula (python)

i faced with the error as below with some file pdf.

tables = read_pdf(file_path, pages = 'all')
Error from tabula-java:
Error: File does not exist


Traceback (most recent call last):

  Input In [71] in <cell line: 1>
    tables = read_pdf(file_path, pages = 'all')

  File ~\anaconda3\lib\site-packages\tabula\io.py:322 in read_pdf
    output = _run(java_options, kwargs, path, encoding)

  File ~\anaconda3\lib\site-packages\tabula\io.py:80 in _run
    result = subprocess.run(

  File ~\anaconda3\lib\subprocess.py:516 in run
    raise CalledProcessError(retcode, process.args,

CalledProcessError: Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

It's seem it's the error with java. But i still can extract dataframe from other pdf file perfectly.

i also tryed to extract table from tabula.exe (which will run in browser in address http://127.0.0.1:8080). it works fine with all pdf file (included the file meet error when trying to run by code)

--------------Update print log-----

print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception
C:/Users/quock/tapetco/Kinh Doanh - Documents/Chứng Từ/Foreign Airports/AEG/Invoice/error/75211-INV-1180235.PDF
Error from tabula-java:
Error: File does not exist


Command '['java', '-Dfile.encoding=UTF8', '-jar', 'C:\\Users\\xxx\\anaconda3\\lib\\site-packages\\tabula\\tabula-1.0.5-jar-with-dependencies.jar', '--pages', 'all', '--guess', '--format', 'JSON', 'C:/Users/xx/yyy/Invoice/75211-INV-1180235.PDF']' returned non-zero exit status 1.

i also update the pdf files file: 75211-INV-1180235.pdf produced error file: APAG_20170615.pdf work fine

file PDF which produced error

regulus
  • 11
  • 5
  • What is the output of the equivalent java command when run on your command-line - without python? – hc_dev Sep 11 '22 at 14:07
  • Did you try using Tabula python-wrapper [tabula-py](https://pypi.org/project/tabula-py/) like in [this example](https://stackoverflow.com/questions/59746275/reading-pdf-file-using-tabula-in-python) ? – hc_dev Sep 11 '22 at 14:11
  • If there is any error, please post the full error-output (including stacktrace) from your console or where your python code was executed. – hc_dev Sep 11 '22 at 14:28
  • where is the code that produced the error ? https://stackoverflow.com/help/minimal-reproducible-example – D.L Sep 11 '22 at 14:30
  • @D.L that is problem. the code is very simple: tables = read_pdf(file_path, pages = 'all') But some pdf file work. some pdf not – regulus Sep 11 '22 at 14:38
  • 1
    you mean some, but not all ? so are you able to provide a working example and a failed example ? – D.L Sep 11 '22 at 14:42
  • @hc_dev thank you. But could you advise what the java-command i will put in the terminal ? So sorry but, i don't know anything in java. :( – regulus Sep 11 '22 at 14:42
  • @D.L updated files (one works fine and one raised error). Thank you – regulus Sep 11 '22 at 15:11

1 Answers1

0

Try debugging with print

Try debugging your script:

  1. print-log the files that you pass as argument to tabula
  2. print-log the output from tabula
print(file_path)  # 1. print the file-path before using tabula on it
# 2a. the try-except block can catch error output
try:
    tables = read_pdf(file_path, pages = 'all')
except Exception as e:
    print(e)  # 2b. print the error-output or exception

The error mentioned suggests that the file passed as argument to tabula does not exist:

Error from tabula-java: Error: File does not exist

See also:

Reproduced with 2 given files

I installed tabula-py using pip3 install tabula-py and prepared this script to run each of the given files against tabula:

Script SO_tablua.py:

import sys
import tabula

if len(sys.argv) < 2:
    print('Missing required argument. Usage: py <PDF>.')
    exit(1)

pdf = sys.argv[1]
print(f"Extracting tables from '{pdf}' using tabula-py with option 'pages=all'..")

try:
    # Read pdf into list of DataFrame
    dfs = tabula.read_pdf(pdf, pages='all')
    print(f"Result:\n{dfs}")
except Exception as e:
    print(f"Error from tabula-py: {e}")
    exit(1)

For the 2 given files it worked without errors:

❯ python3 SO_tabula.py 75211-INV-1180235.PDF
Extracting tables from '75211-INV-1180235.PDF' using tabula-py with option 'pages=all'..
Result:
[                     Unnamed: 0     Invoice #     1180235
0                           NaN  Invoice Date  11/29/2021
1                           NaN         Terms       NET15
2                           NaN      Due Date  12/14/2021
3                           NaN      Currency         USD
4            SERVICE LOCATION :    Customer #       75211
5  Airport: VOMM  - CHENNAI, IN          Page           1, Empty DataFrame
Columns: [No, Trans.Date, Item Desc, Ref. #, Equip. ID, Flight #, Qty, UOM, Unit Price, Extended Price]
Index: []]

Output for second PDF truncated:

❯ python3 SO_tabula.py APAG_20170615.pdf
Extracting tables from 'APAG_20170615.pdf' using tabula-py with option 'pages=all'..
Result:
[   Unnamed: 0   ...

For a fictive (non-existing) file it showed the reported error:

❯ python3 SO_tabula.py APAG_20170615.pdf_
Extracting tables from 'APAG_20170615.pdf_' using tabula-py with option 'pages=all'..
Error from tabula-py: [Errno 2] No such file or directory: 'APAG_20170615.pdf_'

Further analysis

Suppose all the given files exist and can be accessed from your script there seems to be an issue within tabula itself or its python-wrapper.

To analyze this further, I usually would have a look into tabula's logs or search for an (command-line) option (either in tabula.jar or in tabula-py) to show verbose debugging output. But I didn't find any such option.

hc_dev
  • 8,389
  • 1
  • 26
  • 38
  • 1
    updated print-log in question. i also update the two files pdf (1 work file. 1 raised error) in the dropbox link. Could you pls check for me. Thank you so much – regulus Sep 11 '22 at 15:09