0

Possible Duplicate:
Convert Word doc to HTML programmatically in Java

I have a program that is taking a .docx file and opening as an .html file but when converting to html all I get is unreadable strings. I am needing the html of this file as I need to parse it later. When I use the method below to open the file I get unreadable text such as : úL]iN?#tBd!?^ý ?e"0©?®??AäúsIp?¸ü?D?ÂÓâ¨\Dâ>½??Eâcr&Æl\Fâÿ2qJ?U ??IúK&þIb

    FileInputStream fileInput = null;
    BufferedInputStream myBuffer = null;
    DataInputStream dataInput = null;
    fileInput = new FileInputStream(selectedFile);
    myBuffer = new BufferedInputStream(fileInput);
    dataInput = new DataInputStream(myBuffer);
    StringBuilder nHtmlText = new StringBuilder();
    while (dataInput.available() != 0) {
        System.out.println(dataInput.readLine());
        nHtmlText.append(dataInput.readLine());
    }
    htmlText = nHtmlText.toString();

Is there someway to get a clean readable html file for parsing and saving out of this?

Community
  • 1
  • 1
yams
  • 942
  • 6
  • 27
  • 60
  • you can't read a `.docx` file like this. – kaysush Oct 28 '12 at 16:25
  • Where/how are you actually *converting* to HTML? All I see here is an attempt to read the binary content of a file. – jensgram Oct 28 '12 at 16:26
  • docx-files are compressed with the ZIP algorithm – Werner Kvalem Vesterås Oct 28 '12 at 16:27
  • Well that's what I am asking I guess, sorry for the poor description here. Is there a simple and quick way to convert the .docx document to html. Are there examples. I have looked and there isn't much support or documentation out there unfortunately. – yams Oct 28 '12 at 16:29
  • Makoto unfortunately when I follow the link you just posted and follow all of the applicable links there are no solutions in that. All of the links are now dead or moved and not documented. – yams Oct 28 '12 at 16:37

3 Answers3

1

No.

You are reading the raw content of a docx file, this is not html but zipped xml - see here, you would need something to translate the docx to html. The two are very different.

PeteMz
  • 63
  • 5
  • So I would have to convert the docx to xml, I looked for examples and I could not find much. – yams Oct 28 '12 at 16:39
1

Docx4j is a java library that will allow you to open, read and manipulate the docx files. I've used it successfully for that in the past.

It also has the ability to export the contents of a file to HTML. You can read more here: http://www.docx4java.org/svn/docx4j/trunk/docx4j/docs/Docx4j_GettingStarted.html (Section docx to (X)HTML is about halfway down the page)

jcern
  • 7,798
  • 4
  • 39
  • 47
0

If you want to convert a .docx file to .html then you can't directly read the file as it is a binary file. You can use JODConverter for this. I haven't used this personally but this question is near duplicate of this question.

Community
  • 1
  • 1
kaysush
  • 4,797
  • 3
  • 27
  • 47
  • That looks like it does do some html conversion with limitations, I will check that out. – yams Oct 28 '12 at 16:41