Questions tagged [tabula-py]

tabula-py is a wrapper of tabula-java that allows you to extract tables into DataFrame or JSON using Python. You can also extract tables from PDF into CSV, TSV or JSON file.

Installing tabula-py using pip :

pip install tabula-py
132 questions
18
votes
8 answers

Python3 : module 'tabula' has no attribute 'read_pdf'

A .py program works but the exact same code, when exposed as API, doesn't work. The code reads the pdf with Tabula and provides the table content as a output. I've tried : import tabula df = tabula.read_pdf("my_pdf") print(df) and from tabula…
Sukhi
  • 13,261
  • 7
  • 36
  • 53
4
votes
1 answer

Tabula-py read_pdf_with_template() method

I am trying to read a particular portion of a document as a table. It is structured as a table but there are no dividing lines between, cells, rows or columns. I had success with using the read_pdf() method with the area and column arguments. I…
Kunal Gehlot
  • 137
  • 1
  • 12
4
votes
2 answers

How to read tables in pdf when there is line breaks in table by Python tabula-py?

I tried to use Python package, tabula-py to read table in pdf, It seems that line breaks in pdf table cells would separate the contents in the original cell into multiple cells. I tried to search for all kinds of python packages to solve this…
Kevin Huang
  • 51
  • 1
  • 4
3
votes
0 answers

tabula_py issue How to extract pdf table data spread in multiple pages

I am trying to extract all tables data from a pdf using tabula_py as: df=tabula.read_ptabula.read_pdf(test_pdf,stream=True,multiple tables=True,pages="all") The pdf has 3 tables. Second table is on 2 pages. When I try len(df) , it returns 4…
Sharon
  • 51
  • 3
3
votes
2 answers

Stream mode or lattice mode, which one is set as the default in the tabula-py module?

I'm wondering if anyone who is familiar with the tabula-py module for Python can help me with this question. It is not clear in any of the tabula-py documentation whether the tabula.read_pdf() function uses lattice or stream mode extraction as its…
brandwja
  • 41
  • 1
  • 3
3
votes
3 answers

Unable to execute my script when converting it to exe

I created a script to extract data from a pdf using tabula-py and PyPDF2. When I run my program through Jupyter-notebook and from the cmd, it works perfectly. After converting it to executable with pyinstaller, I get this error: Error: Unable to…
paul
  • 121
  • 1
  • 10
3
votes
2 answers

How to fix this error on tabula.read_pdf() function in Python

I am trying to extract tables from a PDF file using Python (Pycharm). I tried the following code: from tabula import wrapper object = wrapper.read_pdf("C:/Users/Ojasvi/Desktop/sample.pdf") However, the error i got…
Ojasvi Jain
  • 79
  • 1
  • 2
  • 5
2
votes
0 answers

Tabula-py not extracting tables correctly

I was building an API that uses tabula to extract table from a pdf. I built the API on the windows machine and deployed it on ubuntu 20. On the windows machine the extraction was flawless, and I was able to perform all the necessary steps. However,…
abhi
  • 337
  • 1
  • 3
  • 12
2
votes
0 answers

Data Wrangling of text extracted from PDF using PyMuPDF possible? (alternating colors for each row) - text positioned in the middle for each row

I extracted data from PDF file. I am sharing a sample of the page here. I extracted data from the PDF using Tabula-py. These are the arguments I used to extract the text from PDF page. import numpy as np import pandas as pd from tabula.io import…
Joe
  • 91
  • 6
2
votes
2 answers

Tables not detected with tabula and camelot

I tried to extract tables from PDFs that are not in proper format that I think. The tables in these PDFs have a table format but not enclosed properly with verical borders. I'll attach the sample pdf and output with both libraries. When I tried to…
Anshul Joshi
  • 55
  • 1
  • 7
2
votes
0 answers

List object to DataFrame | Tabula | read_pdf_with_template

Problem Statement: I'm using Tabula App user interface for selecting dimension of table from PDF file as tabula-template to give dimension in JSON Format. The DataFrame in Tabula App Interface from extracting table after selecting Table dimension is…
2
votes
1 answer

Extract tables from multi-column pdf using Python

I have a pdf in the following format Lorem ipsum dolor sit amet, consectetur |Table 2 | adipiscing elit. Praesent in tortor consequat, |+---------------------------------------------+| rutrum dolor…
Eagle
  • 318
  • 4
  • 16
2
votes
1 answer

How to read table spread across multiple pages, using tabula_py or camelot

Iam using tabula_py to read tables on a pdf. Some are big. I have a lot of cases where a table is on more than one page. Isuue is tabula_py is treating as new table for each page, instead of reading as one large table. Same issue with Camelot
Sharon
  • 51
  • 3
2
votes
0 answers

Extract complete table from PDF using tabula in python

I have a PDF with the table in the below format, column names and data are separated by "--------" col1 col2 col3 col4 col5 col6 col7 col8 col9 col10 col11 col12 col13 ---------------------------------------------------------------------- B ABC1 …
2
votes
1 answer

How to read pdf table in Flutter

In python, tabula-py can be used to extract tables from a pdf file. Is there a way to do the same within a flutter app?
user730376
  • 33
  • 4
1
2 3
8 9