Extracting PDF table, Python3, tabula-py

Question

Attempting to extract a table from PDF using Python 3.6. Seems [pyPDF2][1] is failing and [pdfminer][2] is not compatible with 3.x. I found a python wrapper for [tabula][3].

import tabula
file_list = get_pdf_list()

text = tabula.read_pdf(file_list[0])
print(text)

tabula.convert_into(file_list[0], "test.json", ouput_format="json")

Both read_pdf and convert_into return empty results. PyPDF2 had the same issue. There are no errors when it runs

I'm starting to think it has to do with the format of my pdf. Anyone have more experience? I'm trying to extract a value from a table in a pdf.

It seems that you have deleted at the bottom of the question text the Info about the links you wanted to specify: `[pyPDF2][1] ... [pdfminer][2]`. You can fix it too if you like. — Claudio, Apr 20 '17 at 06:51
you might try here [here with PDFminer/PDFminer-six for Python 3.6](https://stackoverflow.com/questions/39854841/pdfminer-python-3-5/40877143#40877143). Is not perfect but worth a trial. — pyano, Nov 21 '17 at 13:41
@Tadace: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 — 123456, May 29 '22 at 10:23

score 1 · Answer 1 · answered Mar 16 '19 at 21:21

Hope already you got the answer ! But still here is my code ! And I wanted to say that tabula is one of the good PDF tables extractor. Where I'm getting lot of issue with camelot.

install latest pkg of tabula

pip install tabula-py

And the code is !

import os
from tabula import wrapper
os.path.abspath("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

Try this out !

dataninsight · Answer 2 · 2021-11-28T01:08:38.057

0

Extracting PDF table, Python3, tabula-py Using Tabula-py

from tabula import convert_into
table_file = r"pdf_path"
o1_csv = r"file12.csv"
o2_csv = r"file13.csv"
df = convert_into(table_file, o1_csv, output_format='csv', lattice=False, stream=True, pages=1)
df1 = convert_into(table_file, o2_csv, output_format='csv', lattice=True, stream=False, pages=1)
print(df)
print(df1)
Output: print(df) : None
        print(df1): None

But csv files werenot empty

May be the Table has no boudaries which differs it from normal text thats where tabula-py has its feature

stream if true searches for row and columns of table based on text arrangement
lattice if true searches for proper boundaries defining rows and column of a table

edited Nov 28 '21 at 01:08

answered Nov 27 '21 at 11:20

dataninsight

1,069
6
13

@K J no one has used stream or lattice method specifically then the library takes default value. This has worked for me and it can help others. It might be a dead question but i did search for same while working hence once I received the solution posted it to help others. and well that was not app its library"Difference" – dataninsight Nov 28 '21 at 00:38
well it gives same if u dont specify whether it lattice or stream @K J. Just because we never used the library properly with options doesnt mean library wont work.. Its all about reading each library documentation before stating its not working – dataninsight Nov 28 '21 at 00:43
@K J it is the possible fault for your information read the comments first – dataninsight Nov 28 '21 at 00:46
@KJ provided images so u will understand better – dataninsight Nov 28 '21 at 01:09

Extracting PDF table, Python3, tabula-py

2 Answers2