An efficient way to convert document to pdf format

Question

I have been trying to find the efficient way to convert document e.g. doc, docx, ppt, pptx to pdf. So far i have tried docsplit and oowriter, but both took > 10 seconds to complete the job on pptx file having size 1.7MB. Can any one suggest me a better way or suggestions to improve my approach?

What i have tried:

from subprocess import Popen, PIPE
import time

def convert(src, dst):
    d = {'src': src, 'dst': dst}
    commands = [
        '/usr/bin/docsplit pdf --output %(dst)s %(src)s' % d,
        'oowriter --headless -convert-to pdf:writer_pdf_Export %(dst)s %(src)s' % d,
    ]

    for i in range(len(commands)):
        command = commands[i]
        st = time.time()
        process = Popen(command, stdout=PIPE, stderr=PIPE, shell=True) # I am aware of consequences of using `shell=True` 
        out, err = process.communicate()
        errcode = process.returncode
        if errcode != 0:
            raise Exception(err)
        en = time.time() - st
        print 'Command %s: Completed in %s seconds' % (str(i+1), str(round(en, 2)))

if __name__ == '__main__':
    src = '/path/to/source/file/'
    dst = '/path/to/destination/folder/'
    convert(src, dst)

Output:

Command 1: Completed in 11.91 seconds
Command 2: Completed in 11.55 seconds

Environment:

Linux - Ubuntu 12.04
Python 2.7.3

More tools result:

jodconverter took 11.32 seconds

Note that this not a real benchmark. A single result doesn't make sense. Results should be calculated as an average from many trials, and also at least standard deviation should be presented. — BartoszKP, Jan 02 '14 at 22:12
@BartoszKP Thanks for clarification. I have chosen the wrong word. — Aamir Rind, Jan 02 '14 at 22:15
Well, since you're interested in efficiency, "benchmark" is the right word to use, because that's the tool to measure efficiency. So your code is wrong, not words :) — BartoszKP, Jan 02 '14 at 22:19
Yes you are correct :P but i was just trying to give a simple scenario to show my problem. — Aamir Rind, Jan 02 '14 at 22:22
I understand :) But you can never be sure if anything "strange" didn't happen on your single run - like, you've received an e-mail, OS decided to swap some memory pages to disk, GC started its work - many possibilities :) — BartoszKP, Jan 02 '14 at 22:25
The Microsoft and PDF formats are both very complex. 11 seconds might not be out of line. — Mark Ransom, Jan 02 '14 at 22:52
Does it make a difference if you run those commands in the shell instead of in Python? That is, if you run `/usr/bin/docsplit pdf --output dst src` without Python. — janos, Jan 06 '14 at 06:26
IMHO you should try running the code several times (e.g. 20) or do it for more similar files and take an average. You might benefit from OS caching (i.e. `docsplit` and `oowriter` might remain in memory between runs). — Laur Ivan, Jan 06 '14 at 14:06
Actually my aim is to use these commands through python and use in Django application. Whenever a user uploads a document file which is not a PDF i have to convert it to PDF first. So processing is done as soon as user uploads a file. — Aamir Rind, Jan 06 '14 at 18:34
Also when user uploads a file there is a schedule task is created for celery to convert that file to pdf. So single run time needed to be improved here. — Aamir Rind, Jan 06 '14 at 18:44

score 18 · Accepted Answer · answered Jan 06 '14 at 12:42

18

Try calling unoconv from your Python code, it took 8 seconds on my local machine, I don't know if it's fast enough for you:

time unoconv 15.\ Text-Files.pptx
real    0m8.604s

answered Jan 06 '14 at 12:42

avenet

2,894
1
19
26

1

Python Uno is the most reliable way to get decent pdf output from various MS Office document types. It uses (Star|Libre|Open)office backend to convert document. In principle you can do more than just convert documents. You can incorporate **basic** routines as well. I would still use Uno very carefully. Office software are known to be memory hogs. Do look through https://wiki.openoffice.org/wiki/PyUNO_bridge – Supreet Sethi Jan 06 '14 at 12:49
Thanks for your answer i'll try and let you know :) – Aamir Rind Jan 09 '14 at 16:33
Still want it more fast :P but i think that is the best time so far. Thanks – Aamir Rind Jan 13 '14 at 06:00

score 3 · Answer 2 · answered Jan 09 '14 at 16:26

3

Pandoc is a wonderful tool capable of doing what you'd like quickly. Since you're using Popen to effectively shell out the command for the tool, it doesn't matter what language the tool is written in (Pandoc is written in Haskell).

answered Jan 09 '14 at 16:26

jeffknupp

5,966
3
28
29

Thanks for your answer i'll try and let you know :) – Aamir Rind Jan 09 '14 at 16:34
Adding https://pypi.org/project/pypandoc/ for people still looking to do this. It removes the need to use Popen to shell out the command. – Thereissoupinmyfly Jul 12 '18 at 10:48

score 2 · Answer 3 · answered Jan 07 '14 at 18:01

2

Unfortunately I don't have the time to do a full benchmark, but you may want to check out xtopdf, my Python toolkit for PDF creation. It doesn't do the full range of conversions you want, and some of the conversions have limitations, but it may be of use. xtopdf links:

Online presentation about xtopdf - a good summary of what it is, what it does, platforms, features, users, uses etc.: http://slid.es/vasudevram/xtopdf

xtopdf on Bitbucket: https://bitbucket.org/vasudevram/xtopdf

Many blog posts showing how to use xtopdf for various purpose, including many that show how to use it to convert different input formats to PDF: http://jugad2.blogspot.com/search/label/xtopdf

HTH, Vasudev Ram

answered Jan 07 '14 at 18:01

Vasudev Ram

145
1
7

The DOCX conversion on xtopdf appears to extract the text only and strips formatting. Not amazingly useful. – fatuhoku May 25 '16 at 11:39
@fatuhoku: Yes, it does just that. And that is what "some of the conversions have limitations," implies - as should be somewhat obvious if you had read my comment. I rely on libraries for most of the input format conversions, so if they have limitations, so does xtopdf in those cases. Straightforward. Also, not everything has to be "amazingly useful". Just "useful" is good enough for very many use cases - along with some tweaking with custom code or by hand, even. Happens all the time in real life. – Vasudev Ram May 25 '16 at 19:12
Hey @Vasudev didn't mean to put down your project. It's true that I didn't read your whole answer. Too late to edit my comment. With a name like `xtopdf`, saying that it "doesn't do the full range of conversions" is actually an understatement, which prompted my comment for posterity. – fatuhoku May 26 '16 at 14:27
No it isn't an understatement, because the x in the name stands for "solve for x" - which implies, like math equations involving x, that there may not be solutions for some values of x, or there may be, but they are not yet found - or not yet worked on :) Also, you admitted you didn't read my whole answer; and now you are changing the topic from one of those quoted phrases to another in midstream. – Vasudev Ram May 28 '16 at 17:03
Also, the two phrases you quoted (from my answer), occur in the SECOND sentence of my answer (not somewhere much later). So, not only did you not read my whole answer, you did not even read the second sentence before commenting on it. And I even said "it may be of use" - not "will be of use" or "amazingly useful". So you are being overly critical without doing your homework - which is common on the Internet. – Vasudev Ram May 28 '16 at 17:44

score -1 · Answer 4 · answered Feb 14 '15 at 20:59

For doc and docx (but not ppt/pptx), you could try our independent (but commercial) high fidelity rendering engine online at OnlineDemo/docx_to_pdf

By "high fidelity", I mean it is designed from the ground up to have the same line and paragraph breaks, tab stops etc etc as Microsoft Word.

An efficient way to convert document to pdf format

4 Answers4

Linked