0

My use case is the following :
Once every day I upload 1000 single page pdf to Azure Storage and process them with Form Recognizer via python azure-form-recognizer latest client.

So far I’m using the Async version of the client and I send the 1000 coroutines concurrently.

tasks = {asyncio.create_task(analyse_async(doc)): doc for doc in documents}
pending = set(tasks)

# Handle retry
while pending:
    #  backoff in case of 429
    time.sleep(1)

    # concurrent call return_when all completed
    finished, pending = await asyncio.wait(
        pending, return_when=asyncio.ALL_COMPLETED
    )

    
    # check  if task has exception and register for new run.
    for task in finished:
        arg = tasks[task]

        if task.exception():
            new_task = asyncio.create_task(analyze_async(doc))
            tasks[new_task] = doc
            pending.add(new_task)
   

Now I’m not really comfortable with this setup. The main reason being the unpredictable successive states of the service in the same iteration. Can be up then throw 429 then up again. So not enough deterministic for me. I was wondering if another approach was possible. Do you think I should rather increase progressively the transactions. Start with 15 (default TPS) then 50 … 100 until the queue is empty ? Or another option ? Thx

orville
  • 51
  • 6
  • Looks a bit like you should use a semaphone so that there are always a manageable number active until complete like [this](https://stackoverflow.com/a/48486557/6242321) – jwal Nov 26 '22 at 04:22
  • Yes indeed good point. That being said do you think I can configure semaphore to restrict the number of transaction per second authorized (15 per second) i/o number of concurrent coroutines?. So basically 15 concurrent requests sent per second ? also this setup would be great for my use case since I have to handle retry management for each bucket of 15 transactions. – orville Nov 26 '22 at 19:34
  • semaphore is good for limiting to 15 currently active at a given time, as one completes another will start and there will always be 15. 15 per second is a different thing. You need to be really clear on how the limit is actually applied. – jwal Nov 26 '22 at 19:40
  • Yes limit is 15 transaction per second (TPS) – orville Nov 26 '22 at 19:50
  • You can create a dispatch co-process that pushes 15 tasks into an ```asyncio.Queue``` every second and a worker processing the queue, either with or without additional limiting with a semaphore. You can set the queue size to limit at 15 so that if the worker is slower the dispatch cannot add to the queue. If the queue empties faster this throttles on the 15 per second. If slower it throttles on the 15 active at one time. – jwal Nov 26 '22 at 21:35

1 Answers1

0

We need to enable the CORS and make some changes to that CORS to make it available to access the heavy workload.

Follow the procedure to implement the heavy workload in form recognizer.

enter image description here

enter image description here

Make it for page blobs here for higher and best performance.

enter image description here

Redundancy is also required. Make it ZRS for better implementation.

enter image description here

Create a storage account to upload the files.

enter image description here

enter image description here

Go to CORS and add the URL required.

Set the Allowed origins to https://formrecognizer.appliedai.azure.com

enter image description here

Go to containers and upload the documents.

enter image description here

enter image description here

enter image description here

Upload the documents. Use the container and blob information to give as the input for the recognizer. If the case is from Form Recognizer studio, the size of the total documents is considered and also the number of characters limit is there. So suggested to use the python code using the container created as the input folder.

Sairam Tadepalli
  • 1,563
  • 1
  • 3
  • 11