0

I'm working on a Java program that programmatically converts .doc- and .docx-files to pdf. I've tested several different ways to convert .doc- and .docx-files to pdf such as using several open source Java libraries, sadly these libraries would often mess up the layout in the documents.

I've stumbled upon a javascript script to use the underlying Microsoft Word instance to open the file and save it as a PDF (found at: https://superuser.com/questions/17612/batch-convert-word-documents-to-pdfs-free/28303#28303):

var fso = new ActiveXObject("Scripting.FileSystemObject");
var docPath = WScript.Arguments(0);
var pdfPath = WScript.Arguments(1);
docPath = fso.GetAbsolutePathName(docPath);
var objWord = null;
try{
    WScript.Echo("Saving '" + docPath + "' as '" + pdfPath + "'...");
    objWord = new ActiveXObject("Word.Application");
    objWord.Visible = false;
    var objDoc = objWord.Documents.Open(docPath);
    var wdFormatPdf = 17;
    objDoc.SaveAs(pdfPath, wdFormatPdf);
    objDoc.Close();
    WScript.Echo("The CV was succesfully converted.");
} catch(err){
    WScript.Echo("An error occured: " + err.message);
}finally{
    if (objWord != null){
        objWord.Quit();
    }
}

This javascript-script is called from my Java program synchronously for each document.

On a small scale this seems to work great, but when dealing with a lot of documents like several thousands, I encountered a couple of problems:

  • Sometimes one Word process would hang at the 'Save as'-prompt, if this happened user intervention was needed to continue. Until any user interaction the process would just block.
  • Sometimes the Word process would hang at a 'Bookmark'-prompt. The process is also blocked until any user intervention to pass the prompt.

I'm looking for the best/cleanest way to somehow control these Word processes better by giving them a deadline or something. Like giving them 5 seconds to open the Word document and save it as a PDF, after 5 seconds the process would be killed if still active.

I've dealt with something similiar in the past and the solution for that included a 'kill word processes batch script' to kill any WORD processes that were stuck after the program ended. Not very clean but it did its job.

Any experiences or ideas would be appreciated!

Community
  • 1
  • 1
Yannick De Turck
  • 394
  • 1
  • 3
  • 13
  • That is javascript or worse, not Java. – Adder Jan 07 '13 at 16:29
  • 1
    Unless you're trying to learn the technology, just install a pdf printer, and "print" the documents into pdf. I used the (non-free) one available with Adobe Acrobat, but there seems to be many free utilities available to do the same thing. – Gus Jan 07 '13 at 16:36
  • Does http://stackoverflow.com/questions/607669/how-do-i-convert-word-files-to-pdf-programmatically suffer the same problem? (C# alike) – Michael Lloyd Lee mlk Jan 07 '13 at 16:50
  • http://support.microsoft.com/kb/257757/en-us - Microsofts notes on automating Office (they don't recommend it). – Michael Lloyd Lee mlk Jan 07 '13 at 17:11
  • @mlk, the warning is only if the automation is done on the server-side which is not the case here (it's not mentionned in the question). – RealHowTo Jan 07 '13 at 21:22
  • @mlk Thanks for linking that thread. I've been doing a couple of tests with the program mentioned by Eric Ness and it offers more ways to test a Word-file compared to the Javascript-scriptlet. Another 'prompt issue' I was facing was when the Word-file was password-protected. In the C# program I am able to add a default value to be used when prompted for a password. I am then able to catch an Exception, skip the document and continue with the other docs instead of being stuck at the password prompt. I Will perform some more tests to see if I can get through the other blocking prompts as well. – Yannick De Turck Jan 07 '13 at 21:52

3 Answers3

2

You can use https://www.npmjs.com/package/@nativedocuments/docx-wasm serverless (eg AWS Lambda) to perform your conversions in parallel. Lambda takes care of the concurrency. docx-wasm is self-contained (ie no need to be running Microsoft Word). Freemium model.

Edit April 2019

https://github.com/NativeDocuments/docx-to-pdf-on-AWS-Lambda is a sample project for using it on Lambda.

JasonPlutext
  • 15,352
  • 4
  • 44
  • 84
  • docx-wasm is no longer available. Their site has been taken down and they are no longer issuing licences. – lukejkw Nov 20 '20 at 13:14
1

I managed to get around the issue related to the process being stuck at a prompt in Microsoft Word. In my final solution I altered my Java code to make it start the Javascript script in a separate Thread. My main Thread would then sleep for a few seconds and would then check the other Thread.

The other Thread keeps a reference to the Process instance it uses to run the Javascript-script. The main Thread would then check the exitValue of that process, if the script would be stuck at a Microsoft Word prompt a IllegalThreadStateException would be thrown. I would then handle the Exception by killing the process and cleaning up any temporary files left by Microsoft Word.

Yannick De Turck
  • 394
  • 1
  • 3
  • 13
-2

Microsoft support says don't use office unattended neither server side.

If you need simple conversion LibreOffice has a commandline option -convert-to.

Kenster
  • 23,465
  • 21
  • 80
  • 106