9

I have been tasked to find a way to convert a large amount of .docx files to docbook 5. Currently, we open the file in openoffice and save to docbook. This is a time consuming task, but I am confident there is a better way. These files will then be processed further to our custom relax NG schema. Therefore this conversion does not need to be flawless. I have looked around, and will continue to investigate some leads, but have not found anything usefull.

looking at Convert doc/docx to semantic HTML they have suggested upCast, but this does not seem appropriate to my needs.

I am looking for something freely available that I can use from the command line. I ultimately I would like to batch process our files. I have included the linux, python, and java tags for these are the environments I am most comfortable, but would be willing to bend for the right solution. I am trying to do some research before I go out and reinvent the wheel.

Community
  • 1
  • 1
matchew
  • 19,195
  • 5
  • 44
  • 48
  • Considered looking at the openoffice api to script the open+save-as? – Thorbjørn Ravn Andersen Jun 13 '11 at 15:27
  • 1
    I have edited your question and removed quite a bit from it, you have been here for a while but please take a look at the [FAQ] since a signature should not be added, and your PS was subjective and almost a different question. Please review my edit and see if your question is still complete. – Trufa Jun 13 '11 at 15:28
  • it is Trufa, Thanks for the edit. I suppose I am more familiar with email exchanges than I am stackoverflow. @Thorbjørn Ravn Andersen, I have not this maybe a viable solution. – matchew Jun 13 '11 at 15:31

3 Answers3

8

At the risk of earning an archeologist's badge from SX, the answers should include a reference to Pandoc. This does not rely on open office.

pandoc -f docx -t docbook -o newdocbook.dbk --standalone original.docx

intotecho
  • 4,925
  • 3
  • 39
  • 54
7

There are several ways to script this, both using external scripts and scripts within OpenOffice. See the following links for some examples:

Some of the above links aren't using Java or Python, but the principles still apply and the scripts are typically short enough that they can be ported (the first example is in Ruby, but it's my personal favorite due to the simplicity).

bta
  • 43,959
  • 6
  • 69
  • 99
  • thank you, for one reason or another I settled on the python solution http://mail.python.org/pipermail/python-announce-list/2006-May/004951.html – matchew Jun 13 '11 at 19:05
3

You can run openoffice in server mode and feed the docs to it without having to manually open each on.

One way: http://code.google.com/p/bungeni-editor/wiki/RunningTheJODConverterServer

jlargent
  • 86
  • 2
  • thanks for the quick reply, I spent sometime on this earlier this morning, but after getting everything configured properly it was having trouble supporting docx and/or xml – matchew Jun 13 '11 at 19:06