1

I am converting docx files to pdf. Currently I'm converting the docx to txt file, and then writing the txt file to a pdf.

But I would like to convert the docx directly to a pdf from the parsed lxml (maintaining the lxml structure/formatting).

Is there a streamlined way to do this?

Current docx to pdf conversion:

from shutil import copyfile, rmtree
import sys
import os
import zipfile
from lxml import etree

zip_dir = sys.argv[1]
zip_dir_zip_ext = os.path.splitext(zip_dir)[0] + '.zip'
copyfile(zip_dir, zip_dir_zip_ext)
zip_ref = zipfile.ZipFile(zip_dir_zip_ext, 'r')
zip_ref.extractall('./temp')
data = etree.parse('./temp/word/document.xml')
result = [node.text.strip() for node in data.xpath("//w:t", namespaces={'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'})]
   import codecs
   with codecs.open(os.path.splitext(zip_dir)[0]+'_converted_temp.txt', 'w', 'UTF-8') as txt:
   joined_result = '\n'.join(result)
   txt.write(joined_result)

zip_ref.close()
rmtree('./temp')
os.remove(zip_dir_zip_ext)

Inspiration: How do I write a python script that can read doc/docx files and convert them to txt?

piskot
  • 11
  • 5
tim_xyz
  • 11,573
  • 17
  • 52
  • 97
  • have a look at pandoc, a tool built specifically for this purpose. – Dima Chubarov Dec 08 '17 at 16:24
  • "I am converting docx files to pdf. Currently I'm converting the docx to txt file, and then writing the txt file to a pdf." How exactly do you do that? – mzjn Dec 09 '17 at 08:07

0 Answers0