PDF to Word Doc in Python

Question

I've read though the other stack overflow questions regarding this but it doesn't answer my issue, so down vote away. Its version 2.7.

All I want to do is use python to convert a PDF to a Word doc. At minimum convert to text so I can copy and paste into a word doc.

This is the code I have so far. All it prints is the female gender symbol.

Is my code wrong? Am I approaching this wrong? Do some PDFs just not work with PDFMiner? Do you know of any other alternatives to accomplish my goal of converting a PDF to Word, besides using PyPDF2 or PDFMiner?

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file('Bottom Dec.pdf', 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text
print convert_pdf_to_txt(1)

Do you have LibreOffice installed ? If so, read this answer http://stackoverflow.com/a/26358582/797495 — Pedro Lobito, Oct 22 '15 at 00:33
Alas, I do not. Just plain old MS Word; and outdated one at that... 2003. Its my work, not me. I did see that one though. — staredecisis, Oct 22 '15 at 00:36
"Do some PDFs just not work with PDFMiner?" Yes. It is not *a fact* that 'one *always* can extract *all* text correctly from *every* PDF,'. Please post a link to one of the PDFs you are having problems with, so we can determine if the problem lies in your code, PDFMiner, or possibly does not contain any extractable text at all. — Jongware, Oct 22 '15 at 12:59

score 2 · Answer 1 · answered Mar 30 '22 at 10:18

2

from pdf2docx import Converter

pdf_file = 'E:\Muhammad UMER LAR.pdf'

doc_file= 'E:\Lari.docx'
c=Converter(pdf_file)

c.convert(doc_file)
c.close()

answered Mar 30 '22 at 10:18

Muhammad Umer Lari

62
7

1

Please don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes. – Mark Rotteveel Mar 30 '22 at 16:09

score 0 · Answer 2 · answered Sep 27 '19 at 06:51

Another alternative solution is Aspose.Words Cloud SDK for Python, you can install it from pip for PDF to DOC conversion.

import asposewordscloud
import asposewordscloud.models.requests
api_client = asposewordscloud.ApiClient()
api_client.configuration.host = 'https://api.aspose.cloud'
# Get AppKey and AppSID from https://dashboard.aspose.cloud/
api_client.configuration.api_key['api_key'] = 'xxxxxxxxxxxxxxxxxxxxx' # Put your appKey here
api_client.configuration.api_key['app_sid'] = 'xxxxxxxxx-xxxx-xxxxx-xxxx-xxxxxxxxxx' # Put your appSid here

words_api = asposewordscloud.WordsApi(api_client)
filename = '02_pages.pdf'
remote_name = 'TestPostDocumentSaveAs.pdf'
dest_name = 'TestPostDocumentSaveAs.doc'
#upload PDF file to storage
request_stoarge = asposewordscloud.models.requests.UploadFileRequest(filename,remote_name)
response = words_api.upload_file(request_stoarge)
#Convert PDF to DOC and save to storage
save_options = asposewordscloud.SaveOptionsData(save_format='doc', file_name=dest_name)
request = asposewordscloud.models.requests.SaveAsRequest(remote_name, save_options)
result = words_api.save_as(request)
print("Result {}".format(result))

I'm developer evangelist at Aspose.

PDF to Word Doc in Python

2 Answers2