4
  • Python 2.7 (r27:82525, Jul 4 2010, 09:01:59) [MSC v.1500 32 bit (Intel)] on win32
  • Windows XP SP3
  • Python 2.7 pywin32-218
  • Adobe Acrobat X 10.0.0

I want to use Python to automate Acrobat Pro to export a PDF to XML. I already tried it manually using the 'Save As' dialog box from the running program and now want to do it via a Python script. I have read many pages including parts of the Adobe SDK, SDK Forum, VB Forums and am having no luck.

I read Blish's problem here: "Not implemented" Exception when using pywin32 to control Adobe Acrobat

And this page: timgolden python/win32_how_do_i/generate-a-static-com-proxy.html

I am missing something. My code is:

import win32com.client
import win32com.client.makepy

win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')
adobe = win32com.client.DispatchEx('AcroExch.App')
avDoc = win32com.client.DispatchEx('AcroExch.AVDoc')
avDoc.Open('C:\Documents and Settings\PC\Desktop\a_PDF.pdf', 'C:\Documents and Settings\PC\Desktop')
pdDoc = avDoc.GetPDDoc()
jObject = pdDoc.GetJSObject()
jObject.SaveAs('C:\Documents and Settings\PC\Desktop\a_PDF.xml', "com.adobe.acrobat.xml-1-00")

The full error is:

Traceback (most recent call last):
  File "<pyshell#31>", line 1, in <module>
    jObject.SaveAs('C:\Documents and Settings\PC\Desktop\a_PDF.xml', "com.adobe.acrobat.xml-1-00")
  File "C:\Python27\lib\site-packages\win32com\client\dynamic.py", line 511, in __getattr__
    ret = self._oleobj_.Invoke(retEntry.dispid,0,invoke_type,1)
com_error: (-2147467263, 'Not implemented', None, None)

I'm guessing it has to do with make.py but I don't understand how to implement it in my code.

I pulled this line from my code and got the same error when I ran it:

win32com.client.makepy.GenerateFromTypeLibSpec('Acrobat')

I then changed these two lines from 'DispatchEX' to 'Dispatch' and same error:

adobe = win32com.client.Dispatch('AcroExch.App')
avDoc = win32com.client.Dispatch('AcroExch.AVDoc')

When I run the Dispatches by themselves and then call them back I get:

>>> adobe = win32com.client.DispatchEx('AcroExch.App')
>>> adobe
<win32com.gen_py.Adobe Acrobat 10.0 Type Library.CAcroApp instance at 0x18787784>
>>> avDoc = win32com.client.Dispatch('AcroExch.AVDoc')
>>> avDoc
<win32com.gen_py.Adobe Acrobat 10.0 Type Library.CAcroAVDoc instance at 0x20365224>

Does this mean I should make only one call to Dispatch? I pulled:

adobe = win32com.client.Dispatch('AcroExch.App')

and got the same error.

This Adobe site says:

AVDoc    
Product availability: Acrobat, Reader
Platform availability: Macintosh, Windows, UNIX
Syntax
typedef struct _t_AVDoc* AVDoc;

A view of a PDF document in a window. There is one AVDoc per displayed document. Unlike a PDDoc, an AVDoc has a window associated with it.
  • acrobat_sdk/9.1/Acrobat9_1_HTMLHelp/API_References/Acrobat_API_Reference/AV_Layer/AVDoc.html#AVDocSaveParams

The PDDoc page says:

A PDDoc object represents a PDF document. There is a correspondence between a PDDoc and an ASFile. Also, every AVDoc has an associated PDDoc, although a PDDoc may not be associated with an AVDoc.
  • /9.1/Acrobat9_1_HTMLHelp/API_References/Acrobat_API_Reference/PD_Layer/PDDoc.html

I tried the following code and also got the same error:

import win32com.client
import win32com.client.makepy

pdDoc = win32com.client.Dispatch('AcroExch.PDDoc')
pdDoc.Open('C:\Documents and Settings\PC\Desktop\a_PDF.pdf')
jObject = pdDoc.GetJSObject()
jObject.SaveAs('C:\Documents and Settings\PC\Desktop\a_PDF.xml', "com.adobe.acrobat.xml-1-00")

Same error if I change:

pdDoc = win32com.client.Dispatch('AcroExch.PDDoc')

to

pdDoc = win32com.client.gencache.EnsureDispatch('AcroExch.PDDoc')

like here: win32com.client.Dispatch works but not win32com.client.gencache.EnsureDispatch

Community
  • 1
  • 1
user2993272
  • 83
  • 1
  • 2
  • 5

1 Answers1

6

user2993272, you were almost there: just one more line and the code you have should have worked flawlessly.

I'm going to attempt to answer in the same spirit as your question and provide you as much details as I can.

This thread holds the key to the solution you are looking for: https://mail.python.org/pipermail/python-win32/2002-March/000260.html

I admit that the post is not the easiest to find (perhaps Google scores it low based on the age of the content?).

Specifically, applying this piece of advice will get things running for you: https://mail.python.org/pipermail/python-win32/2002-March/000265.html

For completeness, this piece of code should get the job done and not require you to manually patch dynamic.py (snippet should run pretty much out of the box):

# gets all files under ROOT_INPUT_PATH with FILE_EXTENSION and tries to extract text from them into ROOT_OUTPUT_PATH with same filename as the input file but with INPUT_FILE_EXTENSION replaced by OUTPUT_FILE_EXTENSION
from win32com.client import Dispatch
from win32com.client.dynamic import ERRORS_BAD_CONTEXT

import winerror

# try importing scandir and if found, use it as it's a few magnitudes of an order faster than stock os.walk
try:
    from scandir import walk
except ImportError:
    from os import walk

import fnmatch

import sys
import os

ROOT_INPUT_PATH = None
ROOT_OUTPUT_PATH = None
INPUT_FILE_EXTENSION = "*.pdf"
OUTPUT_FILE_EXTENSION = ".txt"

def acrobat_extract_text(f_path, f_path_out, f_basename, f_ext):
    avDoc = Dispatch("AcroExch.AVDoc") # Connect to Adobe Acrobat

    # Open the input file (as a pdf)
    ret = avDoc.Open(f_path, f_path)
    assert(ret) # FIXME: Documentation says "-1 if the file was opened successfully, 0 otherwise", but this is a bool in practise?

    pdDoc = avDoc.GetPDDoc()

    dst = os.path.join(f_path_out, ''.join((f_basename, f_ext)))

    # Adobe documentation says "For that reason, you must rely on the documentation to know what functionality is available through the JSObject interface. For details, see the JavaScript for Acrobat API Reference"
    jsObject = pdDoc.GetJSObject()

    # Here you can save as many other types by using, for instance: "com.adobe.acrobat.xml"
    jsObject.SaveAs(dst, "com.adobe.acrobat.accesstext")

    pdDoc.Close()
    avDoc.Close(True) # We want this to close Acrobat, as otherwise Acrobat is going to refuse processing any further files after a certain threshold of open files are reached (for example 50 PDFs)
    del pdDoc

if __name__ == "__main__":
    assert(5 == len(sys.argv)), sys.argv # <script name>, <script_file_input_path>, <script_file_input_extension>, <script_file_output_path>, <script_file_output_extension>

    #$ python get.txt.from.multiple.pdf.py 'C:\input' '*.pdf' 'C:\output' '.txt'

    ROOT_INPUT_PATH = sys.argv[1]
    INPUT_FILE_EXTENSION = sys.argv[2]
    ROOT_OUTPUT_PATH = sys.argv[3]
    OUTPUT_FILE_EXTENSION = sys.argv[4]

    # tuples are of schema (path_to_file, filename)
    matching_files = ((os.path.join(_root, filename), os.path.splitext(filename)[0]) for _root, _dirs, _files in walk(ROOT_INPUT_PATH) for filename in fnmatch.filter(_files, INPUT_FILE_EXTENSION))

    # Magic piece of code that should get everything working for you!
    # patch ERRORS_BAD_CONTEXT as per https://mail.python.org/pipermail/python-win32/2002-March/000265.html
    global ERRORS_BAD_CONTEXT
    ERRORS_BAD_CONTEXT.append(winerror.E_NOTIMPL)

    for filename_with_path, filename_without_extension in matching_files:
        print "Processing '{}'".format(filename_without_extension)
        acrobat_extract_text(filename_with_path, ROOT_OUTPUT_PATH, filename_without_extension, OUTPUT_FILE_EXTENSION)

I have tested this on WinPython x64 2.7.6.3, Acrobat X Pro

Subhobroto
  • 337
  • 3
  • 10
  • 1
    Hi, I'm using python and acrobat reader pro for the same function, and currently this code gives me the following error:"NotAllowedError: Security settings prevent access to this property or method". Do you know what is causing it? Thank you – dasen Nov 13 '14 at 15:25
  • @dasen: Apologies for the delay. If you still need help, create a new question or contact me directly. – Subhobroto Jan 04 '16 at 04:04