Converting docx to pdf with pure python (on linux, without libreoffice)

Question

I'm dealing with a problem trying to develop a web-app, part of which converts uploaded docx files to pdf files (after some processing). With python-docx and other methods, I do not require a windows machine with word installed, or even libreoffice on linux, for most of the processing (my web server is pythonanywhere - linux but without libreoffice and without sudo or apt install permissions). But converting to pdf seems to require one of those. From exploring questions here and elsewhere, this is what I have so far:

import subprocess

try:
    from comtypes import client
except ImportError:
    client = None

def doc2pdf(doc):
    """
    convert a doc/docx document to pdf format
    :param doc: path to document
    """
    doc = os.path.abspath(doc) # bugfix - searching files in windows/system32
    if client is None:
        return doc2pdf_linux(doc)
    name, ext = os.path.splitext(doc)
    try:
        word = client.CreateObject('Word.Application')
        worddoc = word.Documents.Open(doc)
        worddoc.SaveAs(name + '.pdf', FileFormat=17)
    except Exception:
        raise
    finally:
        worddoc.Close()
        word.Quit()


def doc2pdf_linux(doc):
    """
    convert a doc/docx document to pdf format (linux only, requires libreoffice)
    :param doc: path to document
    """
    cmd = 'libreoffice --convert-to pdf'.split() + [doc]
    p = subprocess.Popen(cmd, stderr=subprocess.PIPE, stdout=subprocess.PIPE)
    p.wait(timeout=10)
    stdout, stderr = p.communicate()
    if stderr:
        raise subprocess.SubprocessError(stderr)

As you can see, one method requires comtypes, another requires libreoffice as a subprocess. Other than switching to a more sophisticated hosting server, is there any solution?

Python-docx does not require Word (nor Windows) because it does practically all the work inside its source code. ("Practically all", barring a few external standard modules such as XML, ZIP stuff, and image handling.) Since Python is a Turing-complete language, you can do the same to create a PDF out of nothing, with no external software. Read [the official specifications](https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf) front-to-back and you'll know why it's far easier to use an external program. — Jongware, Jun 22 '18 at 13:42
@usr2564301 Of course it's easier, but that isn't an option for me without switching servers — Ofer Sadan, Jun 22 '18 at 22:01
Then find a pure Python implementation for creating PDFs (recommending one is against Stack Overflow guidelines, but surely you can use a search engine and find one suitable for your purposes and level of programming), or roll your own. But be warned, there are good reasons "everybody" is using external utilities – read the aforementioned specifications to understand why. — Jongware, Jun 22 '18 at 22:09
why not use an api that you trigger with python e.g. https://www.convertapi.com/docx-to-pdf ? Also check this question https://stackoverflow.com/questions/3815983/whats-the-best-program-api-for-converting-word-docs-to-pdf-that-does-not-requ — Rick, Jun 29 '18 at 13:57

score 24 · Accepted Answer · answered Jun 30 '18 at 21:11

24

The PythonAnywhere help pages offer information on working with PDF files here: https://help.pythonanywhere.com/pages/PDF

Summary: PythonAnywhere has a number of Python packages for PDF manipulation installed, and one of them may do what you want. However, shelling out to abiword seems easiest to me. The shell command abiword --to=pdf filetoconvert.docx will convert the docx file to a PDF and produce a file named filetoconvert.pdf in the same directory as the docx. Note that this command will output an error message to the standard error stream complaining about XDG_RUNTIME_DIR (or at least it did for me), but it still works, and the error message can be ignored.

answered Jun 30 '18 at 21:11

jcgoble3

515
1
4
18

1

I'll have to do some tests to see if it works without messing up the files, but this is exactly the kind of answer I wanted to hear :) will report back results – Ofer Sadan Jul 01 '18 at 12:00
2

This works for me too. It does create a pdf file(with the same filename) but I received the `XDG_RUNTIME_DIR` error as well. To curb this error, I used `export XDG_RUNTIME_DIR=/tmp/` in the bash console and the error disappeared, on the second attempt. Finally, to check if the conversion was successful, I downloaded the pdf file from Pythonanywhere to my computer locally and opened the file to see the contents. All content displayed successfully. – amanb Jul 02 '18 at 11:41
2

Reporting back: This works reasonably well (some problems with right-to-left languages) but this is by far the best solution for me for now (i'll probably migrate to google cloud eventually). Thank you! – Ofer Sadan Jul 03 '18 at 08:01
From Abiword's website: "Please note Windows users: Due to lack of Windows developers on the project, there is no longer a version available for download." – Thom Ives Oct 06 '19 at 16:45
1

@ThomIves While that may be true, this is about Linux usage via PythonAnywhere, so Windows versions are not relevant here. – jcgoble3 Oct 06 '19 at 16:47
@jcgoble3 Agreed, and, while I'd prefer to do everything in Linux, I sometimes have to work in Windows, so I thought I'd let others know who are looking for general solutions. – Thom Ives Oct 06 '19 at 18:11
Thank you so much! That tip with abiword is marvellous! – Pedroski Apr 03 '20 at 23:02
Is abiword free and does it need to be installed separately? can someone help me with a working snippet please? – dgor Dec 08 '21 at 14:35

score 2 · Answer 2 · answered May 09 '19 at 20:47

2

Another one you could use is libreoffice, however as the first responder said the quality will never be as good as using the actual comtypes.

anyways, after you have installed libreoffice, here is the code to do it.

from subprocess import  Popen
LIBRE_OFFICE = r"C:\Program Files\LibreOffice\program\soffice.exe"

def convert_to_pdf(input_docx, out_folder):
    p = Popen([LIBRE_OFFICE, '--headless', '--convert-to', 'pdf', '--outdir',
               out_folder, input_docx])
    print([LIBRE_OFFICE, '--convert-to', 'pdf', input_docx])
    p.communicate()


sample_doc = 'file.docx'
out_folder = 'some_folder'
convert_to_pdf(sample_doc, out_folder)

answered May 09 '19 at 20:47

dfresh22

961
1
15
23

1

this seems not working well in parallel. I create 10 Popen instance to convert 10 docx file, but only get 5 pdf, and without any error outputs. – Z fp Mar 20 '21 at 07:06
interesting, I did this a while ago, but perhaps post your code? – dfresh22 Mar 20 '21 at 07:13
1

I posted a question with my codes: https://stackoverflow.com/questions/66719566/libreoffice-convert-docx-to-pdf-in-parallel-not-working-well @dfresh22 – Z fp Mar 20 '21 at 08:27
2

The title reads: "Converting docx to pdf with pure python (on linux, without LibreOffice)" No LibreOffice. – victorkolis Nov 15 '22 at 16:06
does this preserve all formatting, tables, images, etc.? – mike01010 Jun 29 '23 at 19:03

score 2 · Answer 3 · answered Mar 17 '22 at 08:22

2

Here is docx to pdf code for linux (for windows just download libreoffice and put soffice path instead of soffice)

import subprocess

def generate_pdf(doc_path, path):

    subprocess.call(['soffice',
                 # '--headless',
                 '--convert-to',
                 'pdf',
                 '--outdir',
                 path,
                 doc_path])
    return doc_path
generate_pdf("docx_path.docx", "output_path")

answered Mar 17 '22 at 08:22

nabeel tahir

45
1

1

It works great on Ubuntu (20.04 LTS) with LibreOffice installed. – SimoX Sep 26 '22 at 14:23

Alexey Noskov · Answer 4 · 2023-07-08T13:58:39.207

You can use Aspose.Words for Python to convert DOCX and other document formats to PDF. Code is simple - load a document and save it as PDF:

import aspose.words as aw

doc = aw.Document("in.docx")
doc.save("out.pdf")

Additional conversions options can be specified using PdfSaveOptions, for example PDF compliance: https://docs.aspose.com/words/python-net/convert-a-document-to-pdf/ Though there are additional requirements for Aspose.Words for Python under Linux: https://docs.aspose.com/words/python-net/system-requirements/#system-requirements-for-target-linux-platform

Note: Aspose.Words is a commercial product and has two main limitations in evaluation mode:

It adds an evaluation watermark into the document
It limits the maximum size of the document to several hundreds of paragraphs.

If you would like to test Aspose.Words without evaluation version limitations, you can request a free 30-days temporary license

The license should be applied through the code:

lic = aw.License()
lic.set_license("C:\\Temp\\Aspose.Word.Python.NET.lic")

Moe information here: https://docs.aspose.com/words/python-net/licensing/

Yes, you are right. At the moment Aspose.Words does not support MacOS. But there are plans to support MacOS. — Alexey Noskov, Jul 08 '23 at 13:59

score -1 · Answer 5 · answered Jun 07 '23 at 09:52

I have tried Alexey answer: the tool aspose.words is perfect for conversion but the generated pdf file have watermark, also it contains statements in red ( for evaluation purposes)

def download_approval(request, project_id):
    project = get_object_or_404(Project, pk=project_id)
    doc = DocxTemplate('letter.docx')
    context = {
        'ref_num': project.ref_num,
        'author_name': project.author.get_full_name,
        'approval_date': project.approved_date.date(),
        'project_title': project.title_en
    }
    doc.render(context)
    file_path = project.ref_num + '_' + 'approval_letter.docx'
    full_path = os.path.join(MEDIA_URL, 'approval/') + file_path
    doc.save(full_path)
    doc_final = aw.Document(full_path)
    response = HttpResponse(doc_final.save('research_permission_request.pdf'), content_type='application/pdf')
    response['Content-Disposition'] = 'inline; filename=' + os.path.basename(full_path)
    return response

Yes, Aspose.Words is a commercial product and has two main limitations in evaluation mode: Aspose.Words adds an evaluation watermark into the document and limits the maximum size of the document to several hundreds of paragraphs. If you would like to test Aspose.Words without evaluation version limitations, you can request a free 30-days temporary license: https://purchase.aspose.com/temporary-license/. I have updated the answer. — Alexey Noskov, Jul 08 '23 at 13:54

score -6 · Answer 6 · answered Mar 09 '22 at 05:19

-6

I found a simpliest way to do that in Linux Env...

import os

os.system("lowriter --convert-to pdf" +str(" ") + str(file_path))

answered Mar 09 '22 at 05:19

apar mishra

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 09 '22 at 05:56
4

Very easy indeed, but this question was specifically about not using libreoffice, and it's my understanding that `lowriter` is part of libreoffice – Ofer Sadan Mar 09 '22 at 09:51

Converting docx to pdf with pure python (on linux, without libreoffice)

6 Answers6

Linked