Recently I've working in table extraction, specifically with stream tables. An in this post I saw that tabula achieves very well this kind of extraction.
For example when compares tabula
vs camelot
in "budget.pdf", in the extraction Tabula combines the last two columns. Using .split(' ', expand = True)
can be fixed and then use combine
, join
or merge
make the original pdf table.
I noticed that when the gap between the columns is so close they would be merged in one. In the taks that I'm trying to achieve that is very common. I don't know how well might be my solution because in some examples that I work on in the middle of the dataframe the columns are merged and I have to sort the columns of the whole dataframe.
I would like to know if Tabula has a hyperparameter tuning to deal with that, like PDFMiner
in which you can manage distances between values...