I am converting docx files to pdf. Currently I'm converting the docx to txt file, and then writing the txt file to a pdf.
But I would like to convert the docx directly to a pdf from the parsed lxml (maintaining the lxml structure/formatting).
Is there a streamlined way to do this?
Current docx to pdf conversion:
from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree
zip_dir = sys.argv[1]
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
copyfile(zip_dir, zip_dir_zip_ext)
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
data = etree.parse('./temp/word/document.xml')
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
import codecs
with codecs.open(os.path.splitext(zip_dir)[0]+'_converted_temp.txt', 'w', 'UTF-8') as txt:
joined_result = '\n'.join(result)
txt.write(joined_result)
zip_ref.close()
rmtree('./temp')
os.remove(zip_dir_zip_ext)
Inspiration: How do I write a python script that can read doc/docx files and convert them to txt?