I have a folder containing four PDFs. Each PDF includes only one table and that table spans multiple pages for 3 PDFs. One PDF only has one page and that page also only consists of one table. I'm using Tabula to extract the tables from the PDFs into DataFrames that are eventually output as JSON files.
I can successfully extract and output as JSON, 2 of the PDFs. However, the other 2 give this message:
Reading PDF File: Zeke Allocations (April 2020)
The output file is empty
Reading PDF File: USG Allocations (December 2020)
The output file is empty
This is my code:
import os.path
import tabula
import glob
import json
file_path = "this is the file path"
filename=os.path.splitext(os.path.basename(file_path))[0]
print(f"Processing folder: {filename}")
pdf_files=glob.glob(file_path +'/*.pdf')
print(f" Found {len(pdf_files)} PDF file/s in directory: {file_path}")
for i,files in enumerate (pdf_files):
pdf_name= os.path.splitext(os.path.basename(files))[0]
print(f"Reading PDF File: {pdf_name}")
dfs=tabula.read_pdf(files,pages='all',multiple_tables=False,stream=True,pandas_options={'header':None})
dfs_list=[]
for df in dfs:
df=df.dropna()
print(f"Found {len(dfs)} table/s in: {pdf_name}")
column_names = ['Index', 'Account', 'Allocation($mm)']
df.columns=column_names
print(df)
df_json=df.to_dict('records')
print(df_json)
with open(f" {filename}.JSON",'w') as f:
json.dump(df_json,f, indent=4)
What could I do to extract the content of the other two PDFs? All the PDFs essentially follow this format and they are text-based, not scanned:
Account | Allocation |
---|---|
Alice | 7.8 |
Bob | 2.5 |
Charlie | 4.0 |