0

Recently I've working in table extraction, specifically with stream tables. An in this post I saw that tabula achieves very well this kind of extraction. For example when compares tabula vs camelot in "budget.pdf", in the extraction Tabula combines the last two columns. Using .split(' ', expand = True) can be fixed and then use combine, join or merge make the original pdf table.

I noticed that when the gap between the columns is so close they would be merged in one. In the taks that I'm trying to achieve that is very common. I don't know how well might be my solution because in some examples that I work on in the middle of the dataframe the columns are merged and I have to sort the columns of the whole dataframe.

I would like to know if Tabula has a hyperparameter tuning to deal with that, like PDFMiner in which you can manage distances between values...

Chacho Fuva
  • 353
  • 1
  • 4
  • 17

1 Answers1

0

maintainer of Tabula here.

You can try specifying the horizontal coordinates of the column boundaries. This parameter is exposed in tabula-py in the columns= keyword argument of the read_pdf method.

  • If I don't know the exactly location of the columns, ```tabula-py``` has an option to plot the pdf page as: ```camelot.plot(tables[0], kind='grid').show()``` ? – Chacho Fuva Aug 18 '20 at 20:37
  • You can find out the position of elements on the page with the Measure tool in Acrobat Reader. Here's more information: https://stackoverflow.com/questions/45457054/tabula-extract-tables-by-area-coordinates/45516398#45516398 – Manuel Aristarán Aug 18 '20 at 23:00