0

I have a folder containing four PDFs. Each PDF includes only one table and that table spans multiple pages for 3 PDFs. One PDF only has one page and that page also only consists of one table. I'm using Tabula to extract the tables from the PDFs into DataFrames that are eventually output as JSON files.

I can successfully extract and output as JSON, 2 of the PDFs. However, the other 2 give this message:

Reading PDF File: Zeke Allocations (April 2020)
The output file is empty
Reading PDF File: USG Allocations (December 2020)
The output file is empty

This is my code:

import os.path
import tabula
import glob
import json

file_path = "this is the file path"
filename=os.path.splitext(os.path.basename(file_path))[0]
print(f"Processing folder: {filename}")
pdf_files=glob.glob(file_path +'/*.pdf')
print(f" Found {len(pdf_files)} PDF file/s in directory: {file_path}")


for i,files in enumerate (pdf_files):
    pdf_name= os.path.splitext(os.path.basename(files))[0]
    print(f"Reading PDF File: {pdf_name}")
    dfs=tabula.read_pdf(files,pages='all',multiple_tables=False,stream=True,pandas_options={'header':None})
    dfs_list=[]
    for df in dfs:

        df=df.dropna()
        print(f"Found {len(dfs)} table/s in: {pdf_name}")
        column_names = ['Index', 'Account', 'Allocation($mm)']
        df.columns=column_names
        print(df)


        df_json=df.to_dict('records')
        print(df_json)

        with open(f" {filename}.JSON",'w') as f:
            json.dump(df_json,f, indent=4)

What could I do to extract the content of the other two PDFs? All the PDFs essentially follow this format and they are text-based, not scanned:

Account Allocation
Alice 7.8
Bob 2.5
Charlie 4.0
Rabiya
  • 23
  • 5
  • Have you looked [at this](https://stackoverflow.com/questions/64996114)? They have the same error (although with different parameters). Anyway, I don't think we can replicate this without the files you're using (you did say it happens for some file but not others), but why are you dumping `df_json` inside the *`for df in dfs...`* loop? Then only the last `df` gets saved... would it not make more sense to use [`df=pd.concat(dfs)`](https://pandas.pydata.org/docs/reference/api/pandas.concat.html) (instead of looping through `dfs`) so that rows from all pages are included instead of just the last? – Driftr95 Mar 13 '23 at 01:12

0 Answers0