3

I've been trying to get Document AI batch submission working and having some difficulty. I have single file submission working using RawDocument and suppose I could just iterate over my data set (27k images) but chose batch since it seems like the more appropriate technique.

When I run my code I am seeing an error: "Failed to process all documents". The first few lines of the debug information are:

O:17:"Google\Rpc\Status":5:{ s:7:"*code";i:3;s:10:"*message";s:32:"Failed to process all documents."; s:26:"Google\Rpc\Statusdetails"; O:38:"Google\Protobuf\Internal\RepeatedField":4:{ s:49:"Google\Protobuf\Internal\RepeatedFieldcontainer";a:0:{}s:44:"Google\Protobuf\Internal\RepeatedFieldtype";i:11;s:45:"Google\Protobuf\Internal\RepeatedFieldklass";s:19:"Google\Protobuf\Any";s:52:"Google\Protobuf\Internal\RepeatedFieldlegacy_klass";s:19:"Google\Protobuf\Any";}s:38:"Google\Protobuf\Internal\Messagedesc";O:35:"Google\Protobuf\Internal\Descriptor":13:{s:46:"Google\Protobuf\Internal\Descriptorfull_name";s:17:"google.rpc.Status";s:42:"Google\Protobuf\Internal\Descriptorfield";a:3:{i:1;O:40:"Google\Protobuf\Internal\FieldDescriptor":14:{s:46:"Google\Protobuf\Internal\FieldDescriptorname";s:4:"code";```

The support for this error states that the reason for the error is:

The gcsUriPrefix and gcsOutputConfig.gcsUri parameters need to begin with gs:// and end with a trailing backslash character (/). Check the configuration for the Bucket URIs.

I am not using gcsUriPrefix (should I? My buckets > max batch limit) but my gcsOutputConfig.gcsUri is within these limits. The file list I've provided gives file names (pointed at the right bucket) so should not have a trailing backslash.

Advice welcome

    function filesFromBucket( $directoryPrefix ) {
        // NOT recursive, does not search the structure
        $gcsDocumentList = [];
    
        // see https://cloud.google.com/storage/docs/samples/storage-list-files-with-prefix
        $bucketName = 'my-input-bucket';
        $storage = new StorageClient();
        $bucket = $storage->bucket($bucketName);
        $options = ['prefix' => $directoryPrefix];
        foreach ($bucket->objects($options) as $object) {
            $doc = new GcsDocument();
            $doc->setGcsUri('gs://'.$object->name());
            $doc->setMimeType($object->info()['contentType']);
            array_push( $gcsDocumentList, $doc );
        }
    
        $gcsDocuments = new GcsDocuments();
        $gcsDocuments->setDocuments($gcsDocumentList);
        return $gcsDocuments;
    }
    
    function batchJob ( ) {
        $inputConfig = new BatchDocumentsInputConfig( ['gcs_documents'=>filesFromBucket('the-bucket-path/')] );
    
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentOutputConfig
        // nb: all uri paths must end with / or an error will be generated.
        $outputConfig = new DocumentOutputConfig( 
            [ 'gcs_output_config' =>
                   new GcsOutputConfig( ['gcs_uri'=>'gs://my-output-bucket/'] ) ]
        );
     
        // see https://cloud.google.com/php/docs/reference/cloud-document-ai/latest/V1.DocumentProcessorServiceClient
        $documentProcessorServiceClient = new DocumentProcessorServiceClient();
        try {
            // derived from the prediction endpoint
            $name = 'projects/######/locations/us/processors/#######';
            $operationResponse = $documentProcessorServiceClient->batchProcessDocuments($name, ['inputDocuments'=>$inputConfig, 'documentOutputConfig'=>$outputConfig]);
            $operationResponse->pollUntilComplete();
            if ($operationResponse->operationSucceeded()) {
                $result = $operationResponse->getResult();
                printf('<br>result: %s<br>',serialize($result));
            // doSomethingWith($result)
            } else {
                $error = $operationResponse->getError();
                printf('<br>error: %s<br>', serialize($error));
                // handleError($error)
            }
        } finally {
            $documentProcessorServiceClient->close();
        }    
    }
Stephen
  • 1,607
  • 2
  • 18
  • 40

2 Answers2

2

This turns out to be an ID-10-T error, with definite PEBKAC overtones.

$object->name() does not return the bucket name as part of the path.

Changing $doc->setGcsUri('gs://'.$object->name()); to $doc->setGcsUri('gs://'.$bucketName.'/'.$object->name()); resolves the issue.

Stephen
  • 1,607
  • 2
  • 18
  • 40
1

Usually, the reason for the error "Failed to process all documents" is an incorrect syntax for the input files or output bucket. Since an incorrectly formatted path might still be a "valid" path for Cloud Storage, but not the files you're expecting. (Thank you for checking the error messages page first!)

You don't have to use gcsUriPrefix if you're providing a list of specific documents to process. Although, based on your code, it looks like you're adding all of the files from a GCS directory to the BatchDocumentsInputConfig.gcs_documents field anyway, so it would make sense to try sending the prefix in BatchDocumentsInputConfig.gcs_uri_prefix instead of a list of individual files.

Note: There is a maximum number of files (1000) that can be sent in an individual batch processing request, and specific processors have their own limits for pages.

https://cloud.google.com/document-ai/quotas#content_limits

You can try separating out the files into multiple batch requests to avoid hitting this limit. The Document AI Toolbox Python SDK has built-in functions for this, but you can try re-implementing this in PHP for your own use case. https://github.com/googleapis/python-documentai-toolbox/blob/ba354d8af85cbea0ad0cd2501e041f21e9e5d765/google/cloud/documentai_toolbox/utilities/gcs_utilities.py#L213

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • My intention is to limit the job size appropriately so that I don't have to pre-process my source files, but that is not in the current code, ergo not using gcs_uri_prefix. The output path is above, the input paths look like: "gs://mybucket/dir1/dir2/dir3/0001.jpg" so I"m not certian what could be wrong. The assumption I'm going on now is that using array_push to add documents to the array is not generating an index that the setDocuments() method is expecting, but I end up going down rabbit holes about "repeatedFields",etc, as when I run this with a single entry it works and I see the outputs. – Stephen Jun 01 '23 at 15:35