How to extract Table from PDF in Python?

Question

I have thousands of PDF files, composed only by tables, with this structure:

However, despite being fairly structured, I cannot read the tables without losing the structure.

I tried PyPDF2, but the data comes completely messed up.

import PyPDF2 

pdfFileObj = open(pdf_file.pdf, 'rb') 
pdfReader = PyPDF2.PdfFileReader(pdfFileObj) 
pageObj = pdfReader.getPage(0) 

print(pageObj.extractText())
print(pageObj.extractText().split('\n')[0]) 
print(pageObj.extractText().split('/')[0])

I also tried Tabula, but it only reads the header (and not the content of the tables)

from tabula import read_pdf

pdfFile1 = read_pdf(pdf_file.pdf, output_format = 'json') #Option 1: reads all the headers
pdfFile2 = read_pdf(pdf_file.pdf, multiple_tables = True) #Option 2: reads only the first header and few lines of content

Any thoughts?

try `tabula-py`: https://pypi.org/project/tabula-py/ – ilja May 07 '19 at 07:50 — ilja, May 07 '19 at 07:50

score 8 · Accepted Answer · edited Feb 11 '23 at 10:35

After struggling a little bit, I found a way.

For each page of the file, it was necessary to define into tabula's read_pdf function the area of the table and the limits of the columns.

Here is the working code:

import pypdf
from tabula import read_pdf

# Get the number of pages in the file
pdf_reader = pypdf.PdfReader(pdf_file)
n_pages = len(pdf_reader.pages)

# For each page the table can be read with the following code
table_pdf = read_pdf(
    pdf_file,
    guess=False,
    pages=1,
    stream=True,
    encoding="utf-8",
    area=(96, 24, 558, 750),
    columns=(24, 127, 220, 274, 298, 325, 343, 364, 459, 545, 591, 748),
)

But at this stage, you're hardcoding the area & column limits, aren't you? Which doesn't allow you to find tables dynamically — Amrou, Nov 06 '22 at 10:35

ashishmishra · Answer 2 · 2019-05-07T08:14:07.237

2

Try this: pip install tabula-py

 from tabula import read_pdf
 df = read_pdf("file_name.pdf")

edited May 07 '19 at 08:14

answered May 07 '19 at 07:56

ashishmishra

363
2
14

1

This is the second code that I posted on the question. Tabula is only reading the header of the tables, not the content. When it reads the content, it only reads few lines – fmarques May 07 '19 at 23:36

zzhapar · Answer 3 · 2021-12-08T16:07:08.970

2

use library tabula

pip install tabula

then exract it

import tabula

# this reads page 63
dfs = tabula.read_pdf(url, pages=63, stream=True)

# if you want read all pages
dfs = tabula.read_pdf(url, pages=all)

df[1]

By the way, I tried read pdf files by using another way. Then it works better than library tabula. I will post it soon.

edited Dec 08 '21 at 16:07

answered Sep 22 '21 at 11:24

zzhapar

107
3

6

`pip install tabula` actually installs https://github.com/ronniedada/tabula which is not what you want, try `tabula-py` – Noxeus Nov 23 '21 at 18:52
okay. thank you very much! – zzhapar Dec 08 '21 at 16:05

score 1 · Answer 4 · answered May 28 '22 at 09:03

@fmarques

You could also try a new Python package (SLICEmyPDF) developed by StatCan specially for extracting tabular data from PDF: https://github.com/StatCan/SLICEmyPDF

From my experience SLICEmyPDF outperforms other free Python or R packages. The catch is that it requires the installation of a few extra free software. The instructions for the installation can be found at

https://dataworldofredhairedgirl.blogspot.com/2022/04/how-to-install-statcan-slicemypdf-on.html

How to extract Table from PDF in Python?

4 Answers4

Linked