1

i have around 1.2 Million .doc Files (all around 50kb big) in need of conversion to .docx. So far i tried using Word via win32com interface for Python, but it is really really slow (1-2 Files per Second). Is there any faster way to accomplish this?

Edit: Code im using so far:

def convert_doc_to_docx():
    dir = "sampledir"
    word = win32com.client.Dispatch("Word.Application")
    word.visible = 0
    globData = glob.iglob(dir + "*.doc")
    totalFiles = len([name for name in os.listdir(dir) if os.path.isfile(os.path.join(dir, name))])
    for i, doc in enumerate(globData):
        in_file = os.path.abspath(doc)
        wb = word.Documents.Open(in_file)
        out_file = os.path.abspath(doc + "x")
        wb.SaveAs2(out_file, FileFormat=16)  # file format for docx
        wb.Close()
        os.remove(in_file)
        print(f"{i+1} von {totalFiles} Dateien bearbeitet!")

    word.Quit()

Zergoholic
  • 91
  • 9
  • That script sounds interesting. Is it available somewhere? Its hard to guess without viewing the code, but this seems like a good candidate for multiprocessing. Also, if you are using a traditional spinning hard drive (and have 2 of them), writing to a different disk than reading can make a difference for large payloads like this. I'm not surprised at the numbers if you are doing these serially. – tdelaney Nov 21 '22 at 16:16
  • yea i thought about doing multiprocessing... but i only got 4 cores here, and even 4 times the speed would be still kinda slow so i didnt try yet :D – Zergoholic Nov 21 '22 at 16:24
  • 1
    Multiprocessing won't help because the real work is performed by Word, not Python. You can start 4 instances of Word instead but, as win32com doesn't have async operations, it's tricky. Perhaps a better option would be to use the `wordconv` utility [as shown in this answer](https://stackoverflow.com/questions/10996949/word-file-converter) to convert the files from a shell. They'll still be in Compatibility mode though. You can use Powershell's Parallel-ForEach to use multiple instances of `wordconv` to process all files. You may be able to use more than 4 as part of the operation IO bound – Panagiotis Kanavos Nov 21 '22 at 16:32
  • Yea, im dont know if using word for the conversion makes it significant slower too. Even if you ignore the missing multithreading. – Zergoholic Nov 21 '22 at 16:41
  • @PanagiotisKanavos - How so? Office generally uses a single threaded apartment model. The only way to get parallel conversion is multiple processes. I don't see how "work is performed by word, not python" means that you can only do the work serially. – tdelaney Nov 21 '22 at 21:00

1 Answers1

1

As other commenters have suggested wordconv seems to be a good solution and much faster than using win32com. For ~1700 files transfer time was ~389 seconds or about ~.21 seconds per object. This time largely can depend on your system hardware since it is involving a lot of read and write operations as well as some processing power for the conversion. I basically maxed out 16GB of ram and an old 6th gen i7. Using a HDD probably will slow it down a lot. Even at .21 seconds per object it's going to take like 70 hours (if it's similar to the speed on my machine). But it's a vast improvement of 1-2 second per object which is 10x as long.

I use subprocess.Popen() to run the command C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe -oice -nme srcfile dstfile in the for loop.

Although the recommended way to invoke a subprocess is subprocess.run() I used subprocess.Popen() because it won't wait for the process to finish before continuing. There might be a way to do this with subprocess.run as well but I'm not familiar enough with it to say. (maybe someone can provide feedback on that)

import os
import subprocess
from timeit import default_timer as timer



def convert_doc_to_docx():
    
    src_dir = r"c:\Users\myuser\test"
    out_dir = "c:\\Users\\myuser\\test\\dst\\"
    all_files = [name for name in os.listdir(src_dir) if os.path.isfile(os.path.join(src_dir, name))]
    file_count = len(all_files)

    # change according to where "WordConv.exe" is located on your system
    path_to_wordconv = "C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe"

    print(f"Source dir file count: {file_count}")
    start = timer()
    for file in all_files:
        in_file_path = os.path.join(src_dir, file)
        out_file_path = out_dir + file + "x"

        # this will get process intensive 
        subprocess.Popen([f"{path_to_wordconv}","-oice","-nme",f"{in_file_path}",f"{out_file_path}"])
        
    end = timer()

    count_output_dir = len([name for name in os.listdir(out_dir) if os.path.isfile(os.path.join(out_dir, name))])    
    elapsed_time = end-start
    time_object = elapsed_time / count_output_dir

  
    print(f"Elapsed time: {elapsed_time} second")
    print(f"Time per object: {time_object} second")
    


    return
       

convert_doc_to_docx()

Output

Source dir file count: 1728
Elapsed time: 369.7448267 second
Time per object: 0.21397270063657406 second
kconsiglio
  • 401
  • 1
  • 8