2

My aim is to reduce the json file size, which contains the base64 image sections of the documents by default.

I am using the Document AI - Contract Processor in US region, nodejs SDK.

It is my understanding that setting fieldMask attribute in batchProcessDocuments request filters out the properties that will be in the resulting json. I want to keep only the entities property.

Here are my call parameters:

const documentai = require('@google-cloud/documentai').v1;
const client = new documentai.DocumentProcessorServiceClient(options);
let params = {
    "name": "projects/XXX/locations/us/processors/3e85a4841d13ce5",
    "region": "us",
    "inputDocuments": {
        "gcsDocuments": {
            "documents": [{
                "mimeType": "application/pdf",
                "gcsUri": "gs://bubble-bucket-XXX/files/CymbalContract.pdf"
            }]
        }
    },
    "documentOutputConfig": {
        "gcsOutputConfig": {
            "gcsUri": "gs://bubble-bucket-XXXX/ocr/"
        },
        "fieldMask": {
            "paths": [
                "entities"
            ]
        }
    }
};
client.batchProcessDocuments(params, function(error, operation) {
    if (error) {
        return reject(error);
    }
    return resolve({
        "operationName": operation.name
    });

});

However, the resulting json is still containing the full set of data. enter image description here

Am I missing something here?

redvivi
  • 83
  • 8

2 Answers2

0

The auto-generated documentation for the Node.JS Client Library is a little hard to follow, but it looks like the fieldMask should be a member of the gcsOutputConfig instead of the documentOutputConfig. (I'm surprised the API didn't throw an error)

https://cloud.google.com/nodejs/docs/reference/documentai/latest/documentai/protos.google.cloud.documentai.v1.documentoutputconfig.gcsoutputconfig

The REST Docs are a little more clear

https://cloud.google.com/document-ai/docs/reference/rest/v1/DocumentOutputConfig#gcsoutputconfig


Note: For a REST API call and for other client libraries, the fieldMask is structured as a string (e.g. text,entities,pages.pageNumber)

I haven't tried this with the Node Client libraries before, but I'd recommend trying this as well if moving the parameter doesn't work on its own.

https://cloud.google.com/document-ai/docs/send-request#async-processor

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21
  • Thanks @HoltSkinner. I am unsure to read the documentation correctly, but isn’t `fieldMask` an object with key `paths` and value an array of strings, which contradicts the documentation you quoted above? https://cloud.google.com/nodejs/docs/reference/iap/latest/iap/protos.google.protobuf.ifieldmask – redvivi Jan 18 '23 at 22:51
  • That documentation you linked is for Identity-Aware Proxy, and it may use a different format for Field Masks. – Holt Skinner Jan 20 '23 at 19:35
-1

You can set a field mask something like this. fieldMask: { paths: ["pages.paragraphs", "text"] }

Jin Ming
  • 1
  • 2