1

I have a PDF that contains many underlines and strikethroughs in the text. I would like to be able to convert this PDF to HTML. I have tried many different tools, and all of them will sometimes catch the underlines and strikethroughs as text formatting, and at other times will convert the underlines and strikethroughs to graphics, which is (as far as I can tell) useless to me.

I would really like to know how these programs differentiate between underlines that format text and underlines that are converted to graphics, and how I might be able to access the document and capture everything as text formatting.

I may be taking the wrong approach with this, and am open to any possible solutions, I think I just need to be pointed in the right direction.

Thank you in advance for any assistance.

  • You might also be interested in the ideas presented in [PDF find out if text is underlined or a table cell](http://stackoverflow.com/questions/13948853/pdf-find-out-if-text-is-underlined-or-a-table-cell). – mkl Mar 23 '13 at 22:45

1 Answers1

2

There are no underlines and strikethroughs in PDF, there are just lines being drawn on top of text. PDF tools that detect underlines and strikethroughs will usually look for a line drawing that is close enough to the text, or some other similar heuristics, then add the corresponding style to the text output when converting into another format. However this kind of approach will never work for 100% of the cases.

yms
  • 10,361
  • 3
  • 38
  • 68
  • That is a very good piece of information that I was lacking. Thank you. I understand that part of a PDF is something called the text stream. Does the text stream contain no formatting metadata? – Michael Blaustein Mar 22 '13 at 19:25
  • It's actually page stream, and it just contains drawing operations, where showing text is part of these operations. There is some limited (and optional) formatting data, like font name and font size, however underlines and strikethroughs are not part of this – yms Mar 22 '13 at 19:57
  • 1
    Furthermore the "lines" are not always drawn as lines. Depending on the intended visual style they sometimes are dawn as filled rectangles or (in case of wavy lines) as a multitude of Beziér curves. – mkl Mar 22 '13 at 22:49