Why extracting tables in a converted docx work better than in the original PDF?

Question

I'm trying to perform automaticaly table extraction inside PDF. I know there are several libraries and methods Java and Python, but to my surprise, the method that has worked best for me is to convert my Pdf to a Docx document and from there to extract the tables (thanks to: How to get pictures and tables from .docx document using apache poi?).

My question is this: Assuming that within the format conversion there may be loss of information, why are my results better this way? Tabula hasn't been able to do better automatically. To understand this, I have looked for information (e.g. Extracting table contents from a collection of PDF files) but I'm still very confused.

PD: For the moment, I have used https://github.com/thoqbk/traprange (A method based on Pdfbox), How to extract table as text from the PDF using Python? (PyPdf2) and Tabula. When I get to my home I going to put code and cases, I'm writing from my smartphone.

Well, obviously the PDF-to-DOCX conversion you used is better at recognizing the specific types of tables in your test documents than the other immediate PDF table extractors. This may be due to the information used by the extractors at hand, e.g. whether they make use of tagging information. That said, you only have mentioned one PDF table extractor (tabula which AFAIK is focused on extracting tables from PDFs without tagging) and not at all your PDF-to-DOCX converter, and you furthermore have not shared test files all. Thus, all we can do is guess around wildly which isn't helpful. — mkl, Apr 19 '18 at 10:12
It's true. I'm going to upload my code and my methods as soon as possible. for the inconveniences. — Jorge Galán, Apr 19 '18 at 11:34
You still don't mention which PDF-to-DOCX converter you use... — mkl, Apr 19 '18 at 15:27
@mkl For the moment, I do this manually with https://smallpdf.com/es/pdf-a-word , a web site based on Solid Documents, a framework based on C++ and .NET class libraries. Perhaps this framework has a good recognition of PDF, so my question would be answered and I would need only one OpenSource equivalent. Do you know anyone ? Maybe iText ? Thanks for your help — Jorge Galán, Apr 20 '18 at 08:26
@mkl iText doesn´t work for me. I want a structured data, such as https://solidframework.net/ (it´s the official internal converter used in Adobe products). I don´t know, maybe I need to give up, I can´t find anything OpenSource ... — Jorge Galán, Apr 20 '18 at 09:23

Why extracting tables in a converted docx work better than in the original PDF?

0 Answers0