Open source implementation will be preferred.
Asked
Active
Viewed 4,449 times
6
-
3I would like to know a solution for this too. PDFBox is able to do so (http://java.dzone.com/articles/converting-pdf-html-using?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+javalobby%2Ffrontpage+%28Javalobby+%2F+Java+Zone%29), but in a very limited way. – Alp May 02 '11 at 11:15
3 Answers
2
Obviously, it isn't an easy task, PDF formatting is much richer than HTML's one (plus you must extract images and link them, etc.).
Simple text extraction is much simpler (although not trivial...).
I see in the sidebar of your question a similar question: Converting PDF to HTML with Python which points to a library (poppler, which is apparently written in C++, perhaps can be accessed with JNI/JNA) and to a related question which offers even more answers.