3

I'm wondering if anyone who is familiar with the tabula-py module for Python can help me with this question. It is not clear in any of the tabula-py documentation whether the tabula.read_pdf() function uses lattice or stream mode extraction as its default setting if no lattice or stream argument is passed to the function. Does the code somehow guess which of the two modes would be preferable depending on the "table" encountered in the pdf text and, if not, could you please clarify which of the two extraction modes is being used as the default (therefore rendering one of the two arguments redundant since, de facto, if you set lattice to False then you must by definition be setting stream to True, and vice versa)? Thanks in advance.

It's easy to set the tabula.read_pdf() mode to either lattice or stream mode extraction, so that's not my issue. I just want to know which of the two is used as the default extraction mode if I don't specify which one I want to use.

Dipesh Bajgain
  • 801
  • 1
  • 10
  • 26
brandwja
  • 41
  • 1
  • 3
  • And sorry, just to add another part to this question, can both lattice and stream be set to True at the same time? In other words would the following expression be valid: tabula.read_pdf('test.pdf', stream=True, lattice=True) ? And, if so, how does the tabula code go about "choosing" which of the two extraction modes it should use when it encounters text in a pdf that it recognises as a "table"? – brandwja Jul 19 '19 at 11:27
  • One reason I’m asking the question is that I am also using the newer camelot module that, at least on paper, has similar functionality to tabula-py but claims to deliver superior tabular data extraction from pdf files. However with camelot, more tinkering is required to achieve optimal results; for example while the module also uses similar lattice and stream extraction modes, its camelot.read_pdf() function is set to lattice by default, so in order to do a proper comparison of the two on both quality of output and ease of use I need to know what the default extraction mode is for tabula-py. – brandwja Jul 19 '19 at 17:13

2 Answers2

0

If I understand correctly, tabula-java uses DECIDE method that applies dynamically spreadsheet and lattice page by page. https://github.com/tabulapdf/tabula-java/blob/21b124660a90127d2867a48db04d6412d9c4f438/src/main/java/technology/tabula/CommandLineApp.java#L258

Note that until tabula-java 1.0.2, using guess option forced to use lattice mode by default. tabula-py 1.4.0 uses tabula-java 1.0.3, so you can use guess and stream/lattice separately.

chezou
  • 486
  • 4
  • 12
0

The naming for parsing methods inside Camelot (i.e. Lattice and Stream) was inspired from Tabula. Lattice is used to parse tables that have demarcated lines between cells, while Stream is used to parse tables that have whitespaces between cells to simulate a table structure.


https://github.com/camelot-dev/camelot/wiki/Comparison-with-other-PDF-Table-Extraction-libraries-and-tools

you will get better understanding with this repository

  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. – Tyler2P Aug 18 '21 at 10:26
  • This doesn't answer the question at all. – skytwosea Feb 24 '23 at 03:21