No tables found and merged column text when extracting data from this PDF using Camelot

Question

I get a UserWarning: No tables found on page-1 when I try to extract tables from the attached PDF . However, when I looked at the extracted data, some of the column text was merged into a single column.”

I am using Camelot to parse these PDFs

Steps to reproduce: camelot --output m27.csv --format csv stream m27.pdf

Here is a link to PDF that I am trying to parse: https://github.com/tabulapdf/tabula-java/blob/master/src/test/resources/technology/tabula/m27.pdf

Try SLICEmyPDF in 1 of the answers at https://stackoverflow.com/questions/56017702/how-to-extract-table-from-pdf-in-python/72414309#72414309 — 123456, Jul 02 '22 at 07:46

Vinayak Mehta · Accepted Answer · 2018-11-09T19:21:24.413

A PDF just contains instructions to place a character at an x,y coordinate on a 2-D plane, retaining no knowledge of words, sentences or tables.

Camelot uses PDFMiner under the hood to group characters into words and words into sentences. Sometimes when the characters are too close, PDFMiner can group characters belonging to different words into a single one.

Since the characters in your PDF table are placed very close, they are being merged into a single word and hence Camelot isn't able to detect the columns correctly. You can specify column separators to get the table out in this case. To get the x-coordinates of column separators you can check out the visual debugging guide. Additionally, you can specify split_text=True to cut the word along the column separators you've specified. Here's the code (I got the x-coordinates by creating a matplotlib plot of the text in the PDF using $ camelot stream -plot text m27.pdf):

Using CLI:

$ camelot --output m27.csv --format csv -split stream -C 72,95,209,327,442,529,566,606,683 m27.pdf

Using API:

>>> import camelot
>>> tables = camelot.read_pdf('m27.pdf', flavor='stream', columns=['72,95,209,327,442,529,566,606,683'], split_text=True)

i have simillar problem, but my pdf is image of table, and this large table have each line in different size, i have also receiving an error: `UserWarning: No tables found on page-1 ` any ide how to solve it? do you think its connected with those line sizes? — sygneto, Mar 16 '20 at 17:00

No tables found and merged column text when extracting data from this PDF using Camelot

1 Answers1