4

As mentioned in camelot, we can extract table from particular region like:

tables = camelot.read_pdf('table_regions.pdf', table_regions=['170,370,560,270'])

But how can I find these regions for my pdf.

caner
  • 721
  • 5
  • 21

3 Answers3

2

I know it's a late reply - but I just came across a possible solution.

If you're looking for a automated extraction method, you could use lattice in a first step, retrieve the table boundaries with tables[0]._bbox and use these numbers in a second call to camelot.read_pdf() into the argument table_areas.

Be aware that they are in a weirdly sorted format for a bbox.

1

You can detect this regions, by some visual debugging.

https://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

0

If you just want to detect the table region you are reading, try to do this using Jupyter Notebook:

  1. Define the table region inside .read_pdf method: tables = camelot.read_pdf('table_regions.pdf', table_regions=['170,370,560,270'], flavor='lattice'); pay attention on the flavor, because it defines whether the table have borderlines or not(it can be lattice for borders or stream for space).
  2. Use camelot-py with plot from matplotlib: camelot.plot(tables[index], kind='contour') (You may know about how many index your object have by simply executing the name of the object. e.g.: tables runnign inside .ipynb cell)(contour is a visual debugging).
  3. The plot will show an image of your table with a red rectangle contour. Just repeats step 2 until you achieve the table region you want to extract.
  4. To test if the data is correct just use tables[index].df.
studTon
  • 1
  • 3
  • I can't figure out what the coordinates refer to. Are they corner points? edges of the box? There is no mention in the documentation of this. – skytwosea Feb 16 '23 at 22:23
  • There is a note in the documentation on Advanced Usage page that explains : table_areas accepts strings of the form x1,y1,x2,y2 where (x1, y1) -> top-left and (x2, y2) -> bottom-right in PDF coordinate space. In PDF coordinate space, the bottom-left corner of the page is the origin, with coordinates (0, 0) – studTon Feb 27 '23 at 17:25