As other commenters have suggested wordconv seems to be a good solution and much faster than using win32com. For ~1700 files transfer time was ~389 seconds or about ~.21 seconds per object. This time largely can depend on your system hardware since it is involving a lot of read and write operations as well as some processing power for the conversion. I basically maxed out 16GB of ram and an old 6th gen i7. Using a HDD probably will slow it down a lot. Even at .21 seconds per object it's going to take like 70 hours (if it's similar to the speed on my machine). But it's a vast improvement of 1-2 second per object which is 10x as long.
I use subprocess.Popen()
to run the command C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe -oice -nme srcfile dstfile
in the for loop.
Although the recommended way to invoke a subprocess is subprocess.run() I used subprocess.Popen()
because it won't wait for the process to finish before continuing. There might be a way to do this with subprocess.run as well but I'm not familiar enough with it to say. (maybe someone can provide feedback on that)
import os
import subprocess
from timeit import default_timer as timer
def convert_doc_to_docx():
src_dir = r"c:\Users\myuser\test"
out_dir = "c:\\Users\\myuser\\test\\dst\\"
all_files = [name for name in os.listdir(src_dir) if os.path.isfile(os.path.join(src_dir, name))]
file_count = len(all_files)
# change according to where "WordConv.exe" is located on your system
path_to_wordconv = "C:\\Program Files\\Microsoft Office\\root\\Office16\\Wordconv.exe"
print(f"Source dir file count: {file_count}")
start = timer()
for file in all_files:
in_file_path = os.path.join(src_dir, file)
out_file_path = out_dir + file + "x"
# this will get process intensive
subprocess.Popen([f"{path_to_wordconv}","-oice","-nme",f"{in_file_path}",f"{out_file_path}"])
end = timer()
count_output_dir = len([name for name in os.listdir(out_dir) if os.path.isfile(os.path.join(out_dir, name))])
elapsed_time = end-start
time_object = elapsed_time / count_output_dir
print(f"Elapsed time: {elapsed_time} second")
print(f"Time per object: {time_object} second")
return
convert_doc_to_docx()
Output
Source dir file count: 1728
Elapsed time: 369.7448267 second
Time per object: 0.21397270063657406 second