Python: Convert PDF to DOC

Question

How to convert a pdf file to docx. Is there a way of doing this using python?

I've saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord

Thanks in advance

If the pdf data is tabular, you can use tabula library to process your data and output in doc. — Stuti Verma, Feb 12 '19 at 10:16

score 20 · Accepted Answer · 2014-10-14T10:37:04.163

20

If you have LibreOffice installed

lowriter --invisible --convert-to doc '/your/file.pdf'

If you want to use Python for this:

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)

edited Oct 14 '14 at 10:37

answered Oct 14 '14 at 10:30

When I execute this command on my terminal, it just open a new _empty_ LibreOffice. I'm doing `lowriter --invisible --convert-to doc 'mypdf.pdf'`. But this seems to be what I'm looking for! Thanks! – AlvaroAV Oct 14 '14 at 10:40
@Liarez You can specify the output folder in the arguments. By default the converted file may be present in the ~/Home Directory. Check the help options (`lowriter --help`). Sorry, I couldn't test it now. – Oct 14 '14 at 10:50
Solved! I finally could make the command works! It works as expected, thanks you very much!! – AlvaroAV Oct 14 '14 at 11:11
6

@alvaroav plz share your command which is converting pdf to word using libreoffice. I have latest version of libreoffice – Steeve Sep 17 '16 at 13:01
2

For the macOS users: `/Applications/LibreOffice.app/Contents/MacOS/soffice --invisible --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML" file-to-convert.pdf` – Leland Apr 07 '21 at 20:01

score 9 · Answer 2 · answered Oct 14 '14 at 10:30

This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.

However, you are most definitely going to lose presentational aspects in the conversion.

score 7 · Answer 3 · answered Apr 06 '20 at 19:06

If you want to convert PDF -> MS Word type file like docx, I came across this.

Ahsin Shabbir wrote:

import glob
import win32com.client
import os

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
    print(doc)
    filename = doc.split('\\')[-1]
    in_file = os.path.abspath(doc)
    print(in_file)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
    print("outfile\n",out_file)
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    print("success...")
    wb.Close()

word.Quit()

This worked like a charm for me, converted 500 pages PDF with formatting and images.

I think setting `word.visible = 1` is much better option. This will allow the user to see all the messages or warnings shown by word. If we set `word.visible = 0`, word can not show any error/warnings thereby complicating the debugging experience. — raman, May 19 '20 at 08:15

score 2 · Answer 4 · answered Nov 07 '19 at 15:29

You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.

Sample Python code:

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.docx'
        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

I'm developer evangelist at aspose.

Good idea let's send potentially confidential documents to a third party /s — Paradoxis, Feb 13 '20 at 11:59
Customer has complete control of his cloud storage and can use [any cloud storage](https://docs.groupdocs.cloud/total/configure-3rd-party/) like Amazon S3/Google Drive/ Azure storage/ Dropbox/ FTP Storage etc. of his choice — Tilal Ahmad, Sep 30 '20 at 06:09

score 1 · Answer 5 · answered Aug 04 '21 at 12:33

Based on previews answers this was the solution that worked best for me using Python 3.7.1

import win32com.client
import os

# INPUT/OUTPUT PATH
pdf_path = r"""C:\path2pdf.pdf"""
output_path = r"""C:\output_folder"""

word = win32com.client.Dispatch("Word.Application")
word.visible = 0  # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD

# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\\')[-1]
in_file = os.path.abspath(pdf_path)

# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()

rsc05 · Answer 6 · 2022-09-17T14:37:02.113

With Adobe on your machine

If you have adobe acrobate on your machine you can use the following function that enables you to save the PDF file as docx file

# Open PDF file, use Acrobat Exchange to save file as .docx file.

import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

def PDF_to_Word(input_file, output_file):
    
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
    src = os.path.abspath(input_file)
    
    # Lunch adobe
    win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
    adobe = win32com.client.DispatchEx('AcroExch.App')
    avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
    # Open file
    avDoc.Open(src, src)
    pdDoc = avDoc.GetPDDoc()
    jObject = pdDoc.GetJSObject()
    # Save as word document
    jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
    avDoc.Close(-1)

Be mindful that the input_file and the output_file need to be as follow:

D:\OneDrive...\file.pdf
D:\OneDrive...\dafad.docx

does this preserve all formatting and would it work on linux? — mike01010, Jun 29 '23 at 18:57

score 0 · Answer 7 · answered Nov 17 '22 at 11:28

0

For Linux users with LibreOffice installed try

soffice --invisible --convert-to doc file_name.pdf

If you get an error like Error: no export filter found, abording try this

soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf

answered Nov 17 '22 at 11:28

el2e10

1,518
22
22

Python: Convert PDF to DOC

7 Answers7

With Adobe on your machine

Linked