1

I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same. The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) . Apache POI does not provide an option to format the word document as in the HTML page. I need something that can give me a completely formatted word document.

Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.

I tried parsing the HTML page using JSOUP and I get the contents of the page in my java program. Now I need to pass these contents to a doc/docx file. Can docx4j be helpful to get a formatted docx file?

Please help. Thank you.

Sunmit Girme
  • 559
  • 4
  • 13
  • 30

2 Answers2

1

I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document. Links to both are here:

http://www.docx4java.org/trac/docx4j

http://jsoup.org/

When you are doing it also, the formatting might be iffy.

To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.

sethu
  • 8,181
  • 7
  • 39
  • 65
  • are there any backward compatibility issues with converting docx to doc by just changing the extension? – Sunmit Girme Mar 13 '12 at 06:03
  • I just tried renaming a html file into docx and it seems to work too. So instead of renaming it to doc rename it to docx. But if someone is using Office 97-2003? That might be an issue for those users right? If you rename it to .doc then everyone can use it. If you dont have 97-2003 users then it shouldnt be a problem. – sethu Mar 13 '12 at 08:46
  • I tried renaming the file. I get these errors when I try to open the docx file: 1)The file cannot be opened because there are problems with the content. Details: The file is corrupt and cannot be opened. 2)Word found unreadable content in mySample.docx – Sunmit Girme Mar 13 '12 at 12:07
  • I came across a tool called [aspose.Words](http://www.aspose.com/community/files/72/java-components/aspose.words-for-java/default.aspx). I want something that will provide me with the same functionality. The only thing I want is that it should be open source. – Sunmit Girme Mar 14 '12 at 05:40
  • docx4j includes XHTMLImporter functionality: https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/convert/in/xhtml/XHTMLImporter.java which takes well formed XML as input – JasonPlutext Jul 06 '13 at 00:59
-3

I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically. I have asked another question regarding the same.

Community
  • 1
  • 1
Sunmit Girme
  • 559
  • 4
  • 13
  • 30