Is there any consistent way to extract tables from PDF files? Any tools?
What I have done so far:
- I have tried out
pdftotext
tool. It has an option to convert to HTML layout.
What is the problem with this:
- The table information is not preserved in HTML output
- I expected
<table>
tags, but everything was under<p>
tags.
Will there be any markers in a PDF document to indicate table structures? Like <table>
, <tr>
and <td>
in HTML?
If "yes", any pointers to this would be helpful. If "no", a definite info about this fact is also helpful.