I am looking for ways to convert PDF documents into bare, non-WYSIWYG HTML format. I have examined some solutions over SO such as pdf2htmlEX and other online converters but they are either too old or that the HTML results are mostly geared towards WYSIWYG. I am looking for HTML that retains the format.
For example, if the pdf document contains the following paragraph:
The HTML output should be something like that:
<p>The maximum travel distance for ...</p>
<ol style="list-style-type: lower-alpha;">
<li>In the case of a floor ...</li>
<li>In a large floor area without...</li>
</ol>
Using pdf2htmlEX gives:
<div class="t m0 x3 h6 yf ff3 fs0 fc4 sc0 ls0 ws0">The <span class="_ _4"> </span>maximum <span class="_ _b"> </span>travel <span class="_ _4"> </span>distance <span class="_ _b"> </span>for <span class="_ _4"> </span>the <span class="_ _b"> </span>respective <span class="_ _4"> </span>types <span class="_ _b"> </span>of <span class="_ _4"> </span>occupancies <span class="_ _b"> </span>shall <span class="_ _4"> </span>be <span class="_ _b"> </span>not </div>
<div class="t m0 x2 h6 y10 ff3 fs0 fc4 sc0 ls0 ws0">greater than as laid down in <span class="ff4 fc5">T<span class="_ _8"></span>able 2.2A<span class="ff3 fc4"> and read in conjunction with all of the following:</span></span></div> <div class="t m0 x3 h7 y11 ff3 fs0 fc4 sc0 ls0 ws0">a. <span class="_ _3"> </span><span class="ff5">In <span class="_ _4"> </span>the <span class="_ _4"> </span>case <span class="_ _b"> </span>of <span class="_ _4"> </span>a <span class="_ _4"> </span>oor <span class="_ _b"> </span>area <span class="_ _4"> </span>designed <span class="_ _4"> </span>with <span class="_ _b"> </span>minimum <span class="_ _4"></span>two <span class="_ _4"> </span>exits, <span class="_ _b"> </span>the <span class="_ _4"></span>maximum </span></div>
...
<div class="t m0 x4 h7 y14 ff5 fs0 fc4 sc0 ls0 ws0">nearest exit, shall not exceed the limits specied in <span class="ff4">T<span class="_ _8"></span>able 2.2A<span class="ff3">.</span></span></div>
<div class="t m0 x3 h7 y15 ff3 fs0 fc4 sc0 ls0 ws0">b. <span class="_ _a"> </span><span class="ff5">In <span class="_ _7"></span>a <span class="_ _7"></span>large <span class="_ _7"></span>oor <span class="_ _7"></span>area <span class="_ _d"></span>without <span class="_ _7"></span>sub-division <span class="_ _7"></span>of <span class="_ _d"></span>rooms, <span class="_ _7"></span>corridors <span class="_ _7"></span>and <span class="_ _d"></span>so <span class="_ _7"></span>forth, <span class="_ _7"></span>the </span></div>
Using this online tool gives:
<div style="position:absolute;left:125.28px;top:278.16px" class="cls_006"><span class="cls_006">The maximum travel distance for the respective types of occupancies shall be not</span></div>
<div style="position:absolute;left:85.68px;top:291.74px" class="cls_006"><span class="cls_006">greater than as laid down in </span><span class="cls_012">Table 2.2A</span><span class="cls_006"> and read in conjunction with all of the following:</span></div>
<div style="position:absolute;left:125.28px;top:314.75px" class="cls_006"><span class="cls_006">a.</span></div>
<div style="position:absolute;left:153.57px;top:314.75px" class="cls_006"><span class="cls_006">In the case of a floor area designed with minimum two exits, the maximum</span></div>
<div style="position:absolute;left:153.57px;top:328.33px" class="cls_006"><span class="cls_006">travel distance as given in </span><span class="cls_013">Table 2.2A</span><span class="cls_006"> shall be applicable. The maximum</span></div>
<div style="position:absolute;left:153.57px;top:341.91px" class="cls_006"><span class="cls_006">travel distance starting from the most remote point in any occupied space to the</span></div>
<div style="position:absolute;left:153.57px;top:355.48px" class="cls_006"><span class="cls_006">nearest exit, shall not exceed the limits specified in </span><span class="cls_013">Table 2.2A</span><span class="cls_006">.</span></div>
<div style="position:absolute;left:125.28px;top:378.49px" class="cls_006"><span class="cls_006">b.</span></div>
<div style="position:absolute;left:153.57px;top:378.49px" class="cls_006"><span class="cls_006">In a large floor area without sub-division of rooms, corridors and so forth, the</span></div>
As you can see, everyline is one div
or span
tag, which does not retain the original formating, such as appropriate p
tags or ol
tags. The best result turns out to be a conversion using Adobe Acrobat Pro DC, which yielded this result:
<p style="padding-top: 6pt;padding-left: 28pt;text-indent: 39pt;line-height: 107%;text-align: left;">
The maximum travel distance for the respective types ...</p>
<ol id="l5">
<li style="padding-top: 9pt;padding-left: 96pt;text-indent: -28pt;line-height: 107%;text-align: justify;">
<p style="display: inline;">In the case of a floor area ...</p>
</li>
<li style="padding-top: 9pt;padding-left: 96pt;text-indent: -28pt;line-height: 107%;text-align: justify;">
<p style="display: inline;">In a large floor area without sub-division o...</p>
</li>
</ol>
Is there an API that I can use to achieve the exact same result like that from Adobe? I have searched Adobe's website and they do not offer any APIs to perform such conversion.