0

I'm currently experimenting with tabula-py, but all documentation samples I tried when extracting pdf data resulted in the following error: returned non-zero exit status 1.

So I'm just curious if there is other ways to convert data in tables on a pdf to a csv file using python.

shoedogodo
  • 13
  • 1
  • 3
  • 1
    Does this answer your question? [How to convert PDF to CSV with tabula-py?](https://stackoverflow.com/questions/49560486/how-to-convert-pdf-to-csv-with-tabula-py) – Red May 24 '20 at 04:36
  • @shoedogodo please provide a code snippet to inspect further. – BPDESILVA May 24 '20 at 05:22

2 Answers2

0

The answer for tabula-py is already available on StackOverflow & other resources.. to try using Camelot:

pip install camelot-py[cv]


import camelot
tables = camelot.read_pdf('X.pdf')
tables.export('X.csv', f='csv', compress=True) # you can also save it different file formats

Refer this link for more.

BPDESILVA
  • 2,040
  • 5
  • 15
  • 35
0

If you are looking to export tables from PDF to CSV files using Python the best way it to use libraries like Taluba and Camelot.

First we'll need to extract tables from individual pages and then libraries like pandas to export them into CSVs or other required formats.

However, if the documents are non-electronic, we'll have to use OCR or ML techniques to extract tables.

Here's a blog post which has a few examples: https://nanonets.com/blog/pdf-table-to-csv/#pdf-table-extraction-to-csv-with-python

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - [From Review](/review/late-answers/29943447) – Trenton McKinney Sep 28 '21 at 17:41