0

I'm trying to export a PDF > DOCX using Adobe's REST API: https://documentcloud.adobe.com/document-services/index.html#post-exportPDF

Issue I am facing is not being able to save it correctly locally (it corrupts). I found another thread with similar goal but the solution there isn't working for me. Here are relevant parts of my script:


    url = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%7B%22reltype%22%3A%20%22http%3A%2F%2Fns.adobe.com%2Frel%2Fprimary%22%7D"

    payload = {}

    payload['contentAnalyzerRequests'] = json.dumps(
        {
            "cpf:engine": {
                "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
            },
            "cpf:inputs": {
                "params": {
                    "cpf:inline": {
                        "targetFormat": "docx"
                    }
                },
                "documentIn": {
                    "dc:format": "application/pdf",
                    "cpf:location": "InputFile"
                }
            },
            "cpf:outputs": {
                "documentOut": {
                    "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
                    "cpf:location": docx_filename,
                }
            }
        }
    )

    myfile = {'InputFile': open(filename,'rb')}


    response = requests.request("POST", url, headers=headers, data=payload, files=myfile)
    location = response.headers['location']
    ...
       polling here to make sure export is complete
    ...
    
    if response.status_code == 200:
       print('Export complete, saving file locally.')
       write_to_file(docx_filename, response)



def write_to_file(filename, response):
    with open(filename, 'wb') as f:
        for chunk in response.iter_content(1024 * 1024):
            f.write(chunk)

What I think is the issue (or at least a clue towards solution) is the following text at the begging of the response.content:

--Boundary_357737_1222103332_1635257304781
Content-Type: application/json
Content-Disposition: form-data; name="contentAnalyzerResponse"

{"cpf:inputs":{"params":{"cpf:inline":{"targetFormat":"docx"}},"documentIn":{"dc:format":"application/pdf","cpf:location":"InputFile"}},"cpf:engine":{"repo:assetId":"urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"},"cpf:status":{"completed":true,"type":"","status":200},"cpf:outputs":{"documentOut":{"cpf:location":"output/pdf_test.docx","dc:format":"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}}}
--Boundary_357737_1222103332_1635257304781
Content-Type: application/octet-stream
Content-Disposition: form-data; name="output/pdf_test.docx"
... actual byte content starts here...

Why is this being sent? Am I writing the content to the file incorrectly (I've tried f.write(response.content) as well, same results). Should I be sending a different request to Adobe?

  • That Boundary junk is actually ok -- but is it getting into your final downloaded `.docx` file on disk? –  Oct 26 '21 at 14:22
  • @user17242583 it's being written to the file yes along with the actual bytes. If i manually strip that entire part file is saved correctly and Word will open it. If i just write all the contents to the file (along with Boundary stuff) Word will not open the docx saying it's corrupted. I'm curious to know what that part is, why is it in the Adobe response and is there a better approach than stripping that entire part in order to create a valid docx file – Marjan Stojanov Oct 27 '21 at 15:47

1 Answers1

0

That extra text is actually so that the server can send multiple files at once, see https://stackoverflow.com/a/20321259. Basically the response you're getting is two files: a JSON file called contentAnalyzerResponse, and the Word doc, called output/pdf_test.docx.

You can probably parse the files using parse_form_data from werkzeug.formparser, as demonstrated here, which I've done successfully before, but I'm not sure how to get it working with multiple files.

About your question regarding stripping the content: in light of what I said above, yes, it's perfectly fine to strip it like you're doing.

Note: I'd recommend opening the file in a text editor and checking at the very end of the file to make sure that there isn't any additional --Boundary... stuff that you'll also want to strip out.