0

I want to convert a .pdf into a .docx file. I have tried a few ways, but this is the one which seems best (correct me if I am wrong). I have seen this SO question, but it didn't work for me - it is the same as this:

import PyPDF2

path=r"C:\Users\name\Desktop\test maker tester\Computer Science\414838-2020-specimen-paper-1.pdf"
text=""
pdf_file = open(path, 'rb')
text =""
read_pdf = PyPDF2.PdfFileReader(pdf_file)
c = read_pdf.numPages
for i in range(c):
    page = read_pdf.getPage(i)
    text+=(page.extractText())

It does not give me an error, but I can't find any Word document, and the PDF is still there...

Do you know how to fix this, or can suggest any other way to turn a .pdf into a .docx file?

SherylHohman
  • 16,580
  • 17
  • 88
  • 94
wondercoll
  • 339
  • 1
  • 4
  • 15
  • now it doesn't show an error, but i cant find a word document... and the pdf is still there, but thanks!! – wondercoll Jan 21 '20 at 18:57
  • try this thread: https://stackoverflow.com/questions/50982064/converting-docx-to-pdf-with-pure-python-on-linux-without-libreoffice – APhillips Jan 21 '20 at 20:26
  • 2
    Re, "...but I can't find any Word document." Why do you expect to find one? I don't know PyPDF2, but I don't see anything in your code snippet that says anything about Word documents or, about creating any file. Your variable `text` is initialized with a string, and it sure looks like your loop extracts text strings from the PDF file, and appends them to the `text` string. – Solomon Slow Jan 21 '20 at 20:43
  • wow i am blind! thanks a lot!! – wondercoll Jan 22 '20 at 16:38

1 Answers1

4

You do not have a direct way or a package in python which converts pdf to docx seamlessly. The method that you tried will convert a pdf to docx but all the formatting of the document would be removed and you would only get plain text in the converted docx without the styles.

I have personally tried the Adobe's Document cloud SDK through python which converts pdf to docx by preserving the original native formatting of the pdf document. It takes about 15 secs per document to convert. You can find more information on how to get started using the below links:

https://github.com/adobe/dc-view-sdk-samples

https://www.adobe.io/apis/documentcloud/dcsdk/docs.html

As for the question of using this service through python, you have to use subprocess or os.system commands to invoke the command line commands of this service.

Update:

You can find a detailed explanation of the implementation of this method here Link. Although this is for OCR conversion, the exact same process would work for converting a pdf to docx.

Karthick Mohanraj
  • 1,565
  • 2
  • 13
  • 28