Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python)

Question

I am trying to query Adobe PDF services API to generate (export) DOCX from PDF documents.

I just wrote a python code to generate a Bearer Token in order to be identified from Adobe PDF services (see the question here: https://stackoverflow.com/questions/68351955/tunning-a-post-request-to-reach-adobe-pdf-services-using-python-and-a-rest-api). Then I wrote the following piece of code, where I tried to follow the instruction in this page concerning the EXPORT option of Adobe PDF services (here: https://documentcloud.adobe.com/document-services/index.html#post-exportPDF).

Here is the piece of code :

import requests
import json
from requests.structures import CaseInsensitiveDict

N/B: I didn't write the part of the code generating the Token and enabling identification by the server

>> This part is a POST request to upload my PDF file via form parameters

URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = CaseInsensitiveDict()
headers["x-api-key"] = "client_id"
headers["Authorization"] = "Bearer MYREALLYLONGTOKENIGOT"
headers["Content-Type"] = "application/json"

myfile = {"file":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j="""
{
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/trs_pdf_file_copy.pdf"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}"""

resp = requests.post(url=URL, headers=headers, json=json.dumps(j), files=myfile)
   

print(resp.text)
print(resp.status_code)

The status of the code is 400 I am tho well authentified by the server But I get the following as a result of print(resp.text) :

{"requestId":"the_request_id","type":"Bad Request","title":"Not a multipart request. Aborting.","status":400,"report":"{\"error_code\":\"INVALID_MULTIPART_REQUEST\"}"}

I think that I have problems understanding the "form parameters" from the Adobe Guide concerning POST method for the EXPORT job of the API (https://documentcloud.adobe.com/document-services/index.html).

Would you have any ideas for improvement. thank you !

Two form parameters should be declared as per the Adobe guide (contentAnalyserRequests called "j" in my code and that is a json, and that I didn't introduce in my code because I don't know how). Could you please help? thanks ! — Abdel, Jul 13 '21 at 15:45
So I am _very_ new to Python, but are you sure you are creating a multipart request correctly? The error seems to imply you are not. — Raymond Camden, Jul 13 '21 at 19:43
For example, maybe this: https://stackoverflow.com/a/15785071/52160 — Raymond Camden, Jul 13 '21 at 19:44

PGHE · Accepted Answer · 2021-07-13T22:58:45.857

Make you variable j as a python dict first then create a JSON string from it. What's also not super clear from Adobe's documentation is the value for documentIn.cpf:location needs to be the same as the key used for you file. I've corrected this to InputFile0 in your script. Also guessing you want to save your file so I've added that too.

import requests
import json
import time

URL = "https://cpf-ue1.adobe.io/ops/:create?respondWith=%257B%2522reltype%2522%253A%2520%2522http%253A%252F%252Fns.adobe.com%252Frel%252Fprimary%2522%257D"

headers = {
    'Authorization': f'Bearer {token}',
    'Accept': 'application/json, text/plain, */*',
    'x-api-key': client_id,
    'Prefer': "respond-async,wait=0",
}

myfile = {"InputFile0":open("absolute_path_to_the_pdf_file/input.pdf", "rb")}

j={
  "cpf:engine": {
    "repo:assetId": "urn:aaid:cpf:Service-26c7fda2890b44ad9a82714682e35888"
  },
  "cpf:inputs": {
    "params": {
      "cpf:inline": {
        "targetFormat": "docx"
      }
    },
    "documentIn": {
      "dc:format": "application/pdf",
      "cpf:location": "InputFile0"
    }
  },
  "cpf:outputs": {
    "documentOut": {
      "dc:format": "application/vnd.openxmlformats-officedocument.wordprocessingml.document",
      "cpf:location": "C:/Users/a-bensghir/Downloads/P_D_F/output.docx"
    }
  }
}

body = {"contentAnalyzerRequests": json.dumps(j)}

resp = requests.post(url=URL, headers=headers, data=body, files=myfile)
   

print(resp.text)
print(resp.status_code)

poll = True
while poll:
    new_request = requests.get(resp.headers['location'], headers=headers)
    if new_request.status_code == 200:
        open('test.docx', 'wb').write(new_request.content)
        poll = False
    else:
        time.sleep(5)

Thank you! I took in account these suggestions. Then I got the following message for Print(resp.text) : {"requestId":"aSeriesOfLetters","type":"Bad Request","title":"Not a multipart request. Aborting.","status":400,"report":"{\"error_code\":\"INVALID_MULTIPART_REQUEST\"}"} — Abdel, Jul 13 '21 at 21:38
Make sure you're using the `data` attribute in the `requests.post` and not `json`. — PGHE, Jul 13 '21 at 21:45
Sorry bout that, you're right there's a second request to get the doc. — PGHE, Jul 13 '21 at 22:01
Now it's working well! thanks! I still have a problem.. I don't know why the docx file (its well created by the way) doesn't open, telling via popup that the content is not readable. maybe it's due to the `` 'wb' `` parsing methos — Abdel, Jul 13 '21 at 22:15
Try with a polling mechanism. You might be getting caught out with requesting the doc before it's ready. See edit. — PGHE, Jul 13 '21 at 22:59

score 1 · Answer 2 · answered Dec 05 '21 at 23:55

I don't know why the docx file (its well created by the way) doesn't open, telling via popup that the content is not readable. maybe it's due to the 'wb' parsing methos

I had the same issue. Typecasting to 'bytes' the request contents solved it.

poll = True
    while poll:
        new_request = requests.get(resp.headers['location'], headers=headers)
        if new_request.status_code == 200:
            with open('test.docx', 'wb') as f:
                f.write(bytes(new_request.content))
            poll = False
        else:
            time.sleep(5)

Convert a PDF to DOCX using Adobe PDF Services via REST API (with Python)

N/B: I didn't write the part of the code generating the Token and enabling identification by the server

>> This part is a POST request to upload my PDF file via form parameters

2 Answers2

Linked