4

Attempting to extract a table from PDF using Python 3.6. Seems [pyPDF2][1] is failing and [pdfminer][2] is not compatible with 3.x. I found a python wrapper for [tabula][3].

import tabula
file_list = get_pdf_list()

text = tabula.read_pdf(file_list[0])
print(text)

tabula.convert_into(file_list[0], "test.json", ouput_format="json")

Both read_pdf and convert_into return empty results. PyPDF2 had the same issue. There are no errors when it runs

I'm starting to think it has to do with the format of my pdf. Anyone have more experience? I'm trying to extract a value from a table in a pdf.

Tadace
  • 41
  • 1
  • 6
  • Where can I get Python 3.7? Or do you mean 2.7? – Claudio Apr 19 '17 at 18:49
  • 3.6.. my fault. Edited. – Tadace Apr 19 '17 at 20:38
  • It seems that you have deleted at the bottom of the question text the Info about the links you wanted to specify: `[pyPDF2][1] ... [pdfminer][2]`. You can fix it too if you like. – Claudio Apr 20 '17 at 06:51
  • you might try here [here with PDFminer/PDFminer-six for Python 3.6](https://stackoverflow.com/questions/39854841/pdfminer-python-3-5/40877143#40877143). Is not perfect but worth a trial. – pyano Nov 21 '17 at 13:41
  • You can try https://camelot-py.readthedocs.io. – Vinayak Mehta Nov 09 '18 at 19:38
  • @Tadace: Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 – 123456 May 29 '22 at 10:23

2 Answers2

1

Hope already you got the answer ! But still here is my code ! And I wanted to say that tabula is one of the good PDF tables extractor. Where I'm getting lot of issue with camelot.

install latest pkg of tabula

pip install tabula-py

And the code is !

import os
from tabula import wrapper
os.path.abspath("E:/Documents/myPy/")
tables = wrapper.read_pdf("MyPDF.pdf",multiple_tables=True,pages='all')

i=1
for table in tables:
    table.to_excel('output'+str(i)+'.xlsx',index=False)
    print(i)
    i=i+1

Try this out !

0

Extracting PDF table, Python3, tabula-py Using Tabula-py

from tabula import convert_into
table_file = r"pdf_path"
o1_csv = r"file12.csv"
o2_csv = r"file13.csv"
df = convert_into(table_file, o1_csv, output_format='csv', lattice=False, stream=True, pages=1)
df1 = convert_into(table_file, o2_csv, output_format='csv', lattice=True, stream=False, pages=1)
print(df)
print(df1)
Output: print(df) : None
        print(df1): None

But csv files werenot empty

efile12.csv with stream option true resulted Lattice as true and stream false resulted file13.csv

May be the Table has no boudaries which differs it from normal text thats where tabula-py has its feature

  1. stream if true searches for row and columns of table based on text arrangement
  2. lattice if true searches for proper boundaries defining rows and column of a table
dataninsight
  • 1,069
  • 6
  • 13
  • @K J no one has used stream or lattice method specifically then the library takes default value. This has worked for me and it can help others. It might be a dead question but i did search for same while working hence once I received the solution posted it to help others. and well that was not app its library"Difference" – dataninsight Nov 28 '21 at 00:38
  • well it gives same if u dont specify whether it lattice or stream @K J. Just because we never used the library properly with options doesnt mean library wont work.. Its all about reading each library documentation before stating its not working – dataninsight Nov 28 '21 at 00:43
  • @K J it is the possible fault for your information read the comments first – dataninsight Nov 28 '21 at 00:46
  • @KJ provided images so u will understand better – dataninsight Nov 28 '21 at 01:09