24

How to convert a pdf file to docx. Is there a way of doing this using python?

I've saw some pages that allow user to upload PDF and returns a DOC file, like PdfToWord

Thanks in advance

rsc05
  • 3,626
  • 2
  • 36
  • 57
AlvaroAV
  • 10,335
  • 12
  • 60
  • 91

7 Answers7

20

If you have LibreOffice installed

lowriter --invisible --convert-to doc '/your/file.pdf'

If you want to use Python for this:

import os
import subprocess

for top, dirs, files in os.walk('/my/pdf/folder'):
    for filename in files:
        if filename.endswith('.pdf'):
            abspath = os.path.join(top, filename)
            subprocess.call('lowriter --invisible --convert-to doc "{}"'
                            .format(abspath), shell=True)
  • When I execute this command on my terminal, it just open a new _empty_ LibreOffice. I'm doing `lowriter --invisible --convert-to doc 'mypdf.pdf'`. But this seems to be what I'm looking for! Thanks! – AlvaroAV Oct 14 '14 at 10:40
  • @Liarez You can specify the output folder in the arguments. By default the converted file may be present in the ~/Home Directory. Check the help options (`lowriter --help`). Sorry, I couldn't test it now. –  Oct 14 '14 at 10:50
  • Solved! I finally could make the command works! It works as expected, thanks you very much!! – AlvaroAV Oct 14 '14 at 11:11
  • 6
    @alvaroav plz share your command which is converting pdf to word using libreoffice. I have latest version of libreoffice – Steeve Sep 17 '16 at 13:01
  • 2
    For the macOS users: `/Applications/LibreOffice.app/Contents/MacOS/soffice --invisible --infilter="writer_pdf_import" --convert-to docx:"MS Word 2007 XML" file-to-convert.pdf` – Leland Apr 07 '21 at 20:01
9

This is difficult because PDFs are presentation oriented and word documents are content oriented. I have tested both and can recommend the following projects.

  1. PyPDF2
  2. PDFMiner

However, you are most definitely going to lose presentational aspects in the conversion.

ham-sandwich
  • 3,975
  • 10
  • 34
  • 46
7

If you want to convert PDF -> MS Word type file like docx, I came across this.

Ahsin Shabbir wrote:

import glob
import win32com.client
import os

word = win32com.client.Dispatch("Word.Application")
word.visible = 0

pdfs_path = "" # folder where the .pdf files are stored
for i, doc in enumerate(glob.iglob(pdfs_path+"*.pdf")):
    print(doc)
    filename = doc.split('\\')[-1]
    in_file = os.path.abspath(doc)
    print(in_file)
    wb = word.Documents.Open(in_file)
    out_file = os.path.abspath(reqs_path +filename[0:-4]+ ".docx".format(i))
    print("outfile\n",out_file)
    wb.SaveAs2(out_file, FileFormat=16) # file format for docx
    print("success...")
    wb.Close()

word.Quit()

This worked like a charm for me, converted 500 pages PDF with formatting and images.

eleks007
  • 99
  • 1
  • 3
  • 2
    I think setting `word.visible = 1` is much better option. This will allow the user to see all the messages or warnings shown by word. If we set `word.visible = 0`, word can not show any error/warnings thereby complicating the debugging experience. – raman May 19 '20 at 08:15
  • 2
    @eleks007-san, reqs_path is undefined – Thuấn Đào Minh Feb 02 '21 at 03:41
2

You can use GroupDocs.Conversion Cloud SDK for python without installing any third-party tool or software.

Sample Python code:

# Import module
import groupdocs_conversion_cloud

# Get your app_sid and app_key at https://dashboard.groupdocs.cloud (free registration is required).
app_sid = "xxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx"
app_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

# Create instance of the API
convert_api = groupdocs_conversion_cloud.ConvertApi.from_keys(app_sid, app_key)
file_api = groupdocs_conversion_cloud.FileApi.from_keys(app_sid, app_key)

try:

        #upload soruce file to storage
        filename = 'Sample.pdf'
        remote_name = 'Sample.pdf'
        output_name= 'sample.docx'
        strformat='docx'

        request_upload = groupdocs_conversion_cloud.UploadFileRequest(remote_name,filename)
        response_upload = file_api.upload_file(request_upload)
        #Convert PDF to Word document
        settings = groupdocs_conversion_cloud.ConvertSettings()
        settings.file_path =remote_name
        settings.format = strformat
        settings.output_path = output_name

        loadOptions = groupdocs_conversion_cloud.PdfLoadOptions()
        loadOptions.hide_pdf_annotations = True
        loadOptions.remove_embedded_files = False
        loadOptions.flatten_all_fields = True

        settings.load_options = loadOptions

        convertOptions = groupdocs_conversion_cloud.DocxConvertOptions()
        convertOptions.from_page = 1
        convertOptions.pages_count = 1

        settings.convert_options = convertOptions
 .               
        request = groupdocs_conversion_cloud.ConvertDocumentRequest(settings)
        response = convert_api.convert_document(request)

        print("Document converted successfully: " + str(response))
except groupdocs_conversion_cloud.ApiException as e:
        print("Exception when calling get_supported_conversion_types: {0}".format(e.message))

I'm developer evangelist at aspose.

Tilal Ahmad
  • 940
  • 5
  • 9
  • 7
    Good idea let's send potentially confidential documents to a third party /s – Paradoxis Feb 13 '20 at 11:59
  • Customer has complete control of his cloud storage and can use [any cloud storage](https://docs.groupdocs.cloud/total/configure-3rd-party/) like Amazon S3/Google Drive/ Azure storage/ Dropbox/ FTP Storage etc. of his choice – Tilal Ahmad Sep 30 '20 at 06:09
1

Based on previews answers this was the solution that worked best for me using Python 3.7.1

import win32com.client
import os

# INPUT/OUTPUT PATH
pdf_path = r"""C:\path2pdf.pdf"""
output_path = r"""C:\output_folder"""

word = win32com.client.Dispatch("Word.Application")
word.visible = 0  # CHANGE TO 1 IF YOU WANT TO SEE WORD APPLICATION RUNNING AND ALL MESSAGES OR WARNINGS SHOWN BY WORD

# GET FILE NAME AND NORMALIZED PATH
filename = pdf_path.split('\\')[-1]
in_file = os.path.abspath(pdf_path)

# CONVERT PDF TO DOCX AND SAVE IT ON THE OUTPUT PATH WITH THE SAME INPUT FILE NAME
wb = word.Documents.Open(in_file)
out_file = os.path.abspath(output_path + '\\' + filename[0:-4] + ".docx")
wb.SaveAs2(out_file, FileFormat=16)
wb.Close()
word.Quit()
Jonny_P
  • 127
  • 1
  • 4
0

With Adobe on your machine

If you have adobe acrobate on your machine you can use the following function that enables you to save the PDF file as docx file

# Open PDF file, use Acrobat Exchange to save file as .docx file.

import win32com.client, win32com.client.makepy, os, winerror, errno, re
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

def PDF_to_Word(input_file, output_file):
    
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)
    src = os.path.abspath(input_file)
    
    # Lunch adobe
    win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
    adobe = win32com.client.DispatchEx('AcroExch.App')
    avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
    # Open file
    avDoc.Open(src, src)
    pdDoc = avDoc.GetPDDoc()
    jObject = pdDoc.GetJSObject()
    # Save as word document
    jObject.SaveAs(output_file, "com.adobe.acrobat.docx")
    avDoc.Close(-1)

Be mindful that the input_file and the output_file need to be as follow:

  1. D:\OneDrive...\file.pdf
  2. D:\OneDrive...\dafad.docx
rsc05
  • 3,626
  • 2
  • 36
  • 57
0

For Linux users with LibreOffice installed try

soffice --invisible --convert-to doc file_name.pdf

If you get an error like Error: no export filter found, abording try this

soffice --infilter="writer_pdf_import" --convert-to doc file_name.pdf
el2e10
  • 1,518
  • 22
  • 22