6

Open source implementation will be preferred.

broundee
  • 283
  • 1
  • 3
  • 8
  • 3
    I would like to know a solution for this too. PDFBox is able to do so (http://java.dzone.com/articles/converting-pdf-html-using?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+javalobby%2Ffrontpage+%28Javalobby+%2F+Java+Zone%29), but in a very limited way. – Alp May 02 '11 at 11:15

3 Answers3

2

Obviously, it isn't an easy task, PDF formatting is much richer than HTML's one (plus you must extract images and link them, etc.).
Simple text extraction is much simpler (although not trivial...).
I see in the sidebar of your question a similar question: Converting PDF to HTML with Python which points to a library (poppler, which is apparently written in C++, perhaps can be accessed with JNI/JNA) and to a related question which offers even more answers.

Community
  • 1
  • 1
PhiLho
  • 40,535
  • 6
  • 96
  • 134
1

Try using PDFBox from the apache foundation.

dacracot
  • 22,002
  • 26
  • 104
  • 152
1

Only ones I know of have to be paid for.

BFO
JPedal

Kablam
  • 2,494
  • 5
  • 26
  • 47