0

I am looking for ways to convert PDF documents into bare, non-WYSIWYG HTML format. I have examined some solutions over SO such as pdf2htmlEX and other online converters but they are either too old or that the HTML results are mostly geared towards WYSIWYG. I am looking for HTML that retains the format.

For example, if the pdf document contains the following paragraph: enter image description here

The HTML output should be something like that:

<p>The maximum travel distance for ...</p>
<ol style="list-style-type: lower-alpha;">
    <li>In the case of a floor ...</li>
    <li>In a large floor area without...</li>
</ol>

Using pdf2htmlEX gives:

<div class="t m0 x3 h6 yf ff3 fs0 fc4 sc0 ls0 ws0">The <span class="_ _4"> </span>maximum <span class="_ _b"> </span>travel <span class="_ _4"> </span>distance <span class="_ _b"> </span>for <span class="_ _4"> </span>the <span class="_ _b"> </span>respective <span class="_ _4"> </span>types <span class="_ _b"> </span>of <span class="_ _4"> </span>occupancies <span class="_ _b"> </span>shall <span class="_ _4"> </span>be <span class="_ _b"> </span>not </div>
<div class="t m0 x2 h6 y10 ff3 fs0 fc4 sc0 ls0 ws0">greater than as laid down in <span class="ff4 fc5">T<span class="_ _8"></span>able 2.2A<span class="ff3 fc4"> and read in conjunction with all of the following:</span></span></div> <div class="t m0 x3 h7 y11 ff3 fs0 fc4 sc0 ls0 ws0">a. <span class="_ _3"> </span><span class="ff5">In <span class="_ _4"> </span>the <span class="_ _4"> </span>case <span class="_ _b"> </span>of <span class="_ _4"> </span>a <span class="_ _4"> </span>oor <span class="_ _b"> </span>area <span class="_ _4"> </span>designed <span class="_ _4"> </span>with <span class="_ _b"> </span>minimum <span class="_ _4"></span>two <span class="_ _4"> </span>exits, <span class="_ _b"> </span>the <span class="_ _4"></span>maximum </span></div>
...
<div class="t m0 x4 h7 y14 ff5 fs0 fc4 sc0 ls0 ws0">nearest exit, shall not exceed the limits specied in <span class="ff4">T<span class="_ _8"></span>able 2.2A<span class="ff3">.</span></span></div>
<div class="t m0 x3 h7 y15 ff3 fs0 fc4 sc0 ls0 ws0">b. <span class="_ _a"> </span><span class="ff5">In <span class="_ _7"></span>a <span class="_ _7"></span>large <span class="_ _7"></span>oor <span class="_ _7"></span>area <span class="_ _d"></span>without <span class="_ _7"></span>sub-division <span class="_ _7"></span>of <span class="_ _d"></span>rooms, <span class="_ _7"></span>corridors <span class="_ _7"></span>and <span class="_ _d"></span>so <span class="_ _7"></span>forth, <span class="_ _7"></span>the </span></div>

Using this online tool gives:

<div style="position:absolute;left:125.28px;top:278.16px" class="cls_006"><span class="cls_006">The maximum travel distance for the respective types of occupancies shall be not</span></div>
<div style="position:absolute;left:85.68px;top:291.74px" class="cls_006"><span class="cls_006">greater than as laid down in </span><span class="cls_012">Table 2.2A</span><span class="cls_006"> and read in conjunction with all of the following:</span></div>
<div style="position:absolute;left:125.28px;top:314.75px" class="cls_006"><span class="cls_006">a.</span></div>
<div style="position:absolute;left:153.57px;top:314.75px" class="cls_006"><span class="cls_006">In the case of a floor area designed with minimum two exits, the maximum</span></div>
<div style="position:absolute;left:153.57px;top:328.33px" class="cls_006"><span class="cls_006">travel distance as given in </span><span class="cls_013">Table 2.2A</span><span class="cls_006"> shall be applicable.  The maximum</span></div>
<div style="position:absolute;left:153.57px;top:341.91px" class="cls_006"><span class="cls_006">travel distance starting from the most remote point in any occupied space to the</span></div>
<div style="position:absolute;left:153.57px;top:355.48px" class="cls_006"><span class="cls_006">nearest exit, shall not exceed the limits specified in </span><span class="cls_013">Table 2.2A</span><span class="cls_006">.</span></div>
<div style="position:absolute;left:125.28px;top:378.49px" class="cls_006"><span class="cls_006">b.</span></div>
<div style="position:absolute;left:153.57px;top:378.49px" class="cls_006"><span class="cls_006">In a large floor area without sub-division of rooms, corridors and so forth, the</span></div>

As you can see, everyline is one div or span tag, which does not retain the original formating, such as appropriate p tags or ol tags. The best result turns out to be a conversion using Adobe Acrobat Pro DC, which yielded this result:

<p style="padding-top: 6pt;padding-left: 28pt;text-indent: 39pt;line-height: 107%;text-align: left;">
    The maximum travel distance for the respective types ...</p>
<ol id="l5">
    <li style="padding-top: 9pt;padding-left: 96pt;text-indent: -28pt;line-height: 107%;text-align: justify;">
        <p style="display: inline;">In the case of a floor area ...</p>
    </li>
    <li style="padding-top: 9pt;padding-left: 96pt;text-indent: -28pt;line-height: 107%;text-align: justify;">
        <p style="display: inline;">In a large floor area without sub-division o...</p>
    </li>
</ol>

Is there an API that I can use to achieve the exact same result like that from Adobe? I have searched Adobe's website and they do not offer any APIs to perform such conversion.

Koh
  • 2,687
  • 1
  • 22
  • 62
  • Is your example PDF tagged? In that case I would assume Adobe makes use of tagging information while the other tools don't. Can you share it? – mkl Sep 21 '19 at 10:00
  • @mkl what do you mean by tagged? sure, you can dl the pdf [here] : (https://drive.google.com/open?id=1Rp5HN1Cgye9kL2cM7Qu3lQMAaduklYvJ) – Koh Sep 21 '19 at 10:38
  • Is it possible that you have extracted the HTML from a complete PDF but shared a partial PDF with only the page in question? When I try to export HTML from that shared PDF, Adobe Acrobat complains about a broken structure. And indeed, the tags in the PDF are utterly broken, as if this single page was extracted from a larger PDF by a program that does not support tagging. – mkl Sep 21 '19 at 11:04
  • @mkl No, the html I shared in the code above was extracted from this 1 page pdf. Yes, this 1 page pdf was extracted from a larger PDF that is 500 pages. I extracted just 1 page from this 500 pages, and then open on Acrobat to do the export. The result of the export is that as shared above. – Koh Sep 21 '19 at 12:36
  • Hi there, any help pls? – Koh Sep 23 '19 at 06:16
  • Hi there, any help here pls? – Koh Nov 08 '19 at 03:58
  • I mostly know free pdf libraries, with them you get a result as you want only for properly tagged pdfs unless you invest a lot of time (some months) into a structural analysis of the drawn text. As mentioned above the tagging structure hierarchy is broken in your example pdf. – mkl Nov 08 '19 at 08:11
  • @mkl do you mind recommending some of those pdf libraries that work for properly tagged pdfs? I can give it a shot at those pdfs that are properly tagged. – Koh Nov 08 '19 at 08:43
  • Have a look at [this answer](https://stackoverflow.com/a/54983991/1729265). At the bottom there is proof-of-concept code for tagged text extraction using PDFBox; but that code does not use all information from the structure node, consider using the style information from there. Similarly you can look into the `TaggedPdfReaderTool` of iText 5. – mkl Nov 14 '19 at 12:11

0 Answers0