0

I have been successful in sending a PDF File stored in GCS to the Document AI v1beta2 API. But in v1beta3 API, the file approach is no longer supported. It requires me to send the content in the JSON. Here is the documentation I am following: https://cloud.google.com/document-ai/docs/form-parser#v1beta3

Some questions:

  1. What if anything do I have to do to the PDF content returned from a GET request? The PDF content appears to be in a base64 string which is what the API requires.

  2. Looking at the API request, do you see anything incorrect?

REQUEST INFORMATION
ID: N/A
Method: POST
URL/Path: https://us-documentai.googleapis.com/v1beta3/projects/38072577434/locations/us/processors/cd8a06d0cd3cb045:process
Headers: Content-Type: application/json, Accept: application/json
Authorization: :censored:6:c2dc31949c: :censored:179:27504afa53:
Params: N/A

Data:
{"document":{"mimeType":"application/pdf","content":["%PDF-1.4\n1 0 obj\n<<\n/Title (��\u0000C\u0000r\u0000y\u0000s\u0000t\u0000a\u0000l\u0000 \u0000R\u0000e\u0000p\u0000o\u0000r\u0000t\u0000 \u0000V\u0000i\u0000e\u0000w\u0000e\u0000r)\n/Creator (��\u0000w\u0000k\u0000h\u0000t\u0000m\u0000l\u0000t\u0000o\u0000p\u0000d\u0000f\u0000 \u00000\u0000.\u00001\u00002\u0000.\u00005)\n/Producer (��\u0000Q\u0000t\u0000 \u00004\u0000.\u00008\u0000.\u00007)\n/CreationDate (D:20201219164504Z)\n>>\nendobj\n3 0 obj\n<<\n/Type /ExtGState\n/SA true\n/SM 0.02\n/ca 1.0\n/CA 1.0\n/AIS false\n/SMask /None>>\nendobj\n4 0 obj\n[/Pattern /DeviceRGB]\nendobj\n8 0 obj\n<<\n/Type /Annot\n/Subtype /Link\n/Rect [3.75000000  339.500000  102.750000  345.500000 ]\n/Border [0 0 0]\n/A <<\n/Type /Action\n/S /URI\n/URI (http://www.schooldude.com/)\n>>\n>>\nendobj\n9 0 obj\n<<\n/Type /Catalog\n/Pages 2 0 R\n>>\nendobj\n5 0 obj\n<<\n/Type /Page\n/Parent 2 0 R\n/Contents 10 0 R\n/Resources 12 0 R\n/Annots 13 0 R\n/MediaBox [0 0 595 842]\n>>\nendobj\n12 0 obj\n<<\n/ColorSpace <<\n/PCSp 4 0 R\n/CSp /DeviceRGB\n/CSpg /DeviceGray\n>>\n/ExtGState <<\n/GSa 3 0 R\n>>\n/Pattern <<\n>>\n/Font <<\n/F6 6 0 R\n/F7 7 0 R\n>>\n/XObject <<\n>>\n>>\nendobj\n13 0 obj\n[ 8 0 R ]\nendobj\n10 0 obj\n<<\n/Length 11 0 R\n/Filter /FlateDecode\n>>\nstream\nx��]M�ܸ\u0011����9���o\u0012\b\u0002x>6@\u000e\u0001\f\u001b�!�!�f\u0013,֋8{��\u000fI}t���\u001eq\u001e=�x�X���*=U�*V\u0015)��\u001f?����oͻ���n>�?\u001f>\u001eڣ3m�_���]~�l>9|m�\u001e>\u001c>Ŀ������\u000b꘾)l����_���h�\u0012.~�NM_���/�k~�\u0002\u0007H\"����w\u001d�x�=�����Jex��#��#�x�nU�\u001c~J\b�j�||�s��\u001b��)��s�׿�\u000f<\u0003�}�\u001c���û\u001f�M���O\u0011sV\\��S�z'T�����I\u0015h>�|x��AQ\u0015'�9J\u0019Z�\u001a���\u0007����C8*i������σ<�������r���rF��T�\u0003�8r��|9����7����,od�e���Y�}�gce��a�}�@'�\u0002#B�(M�g��J|�-���d\u001f��3v1�x��]4����E��d�\tc�\u0013�W���\\�\u001a�7w։:���\u000e��Vh�Ą1ZJx���dި���/�~��d�B�8x�\u00030�����|\u001f9TrI�\r�E}tM�\u0015�\u00006J�䉐\u0004�g\u0002o�BB6w��\n .�\u001e��5\u0018��[\u001a�\u0014;�\u0002%�s�D��\f y�c�ډ�Xe���P&+V�L�$f�sEF��\u0018;�ۉ��nkFO�*�,{\u0014V�Q3I܃��)b%S��]���>��ZDɍ�;!@\u000e��\u0018�M�e@\u0016�e���\u001b�w€\u001c��1@Iz��\"�\f\u0018Q�\u0018��n\u0006\u0001�j�\u0002:\u0016�d\u0006d�\\\u0006�(y\f���\u0001���\f�|�[@��[Aw�b��\\�)�xϚ*�f����P�(�\u0012}#\u0015��#\u0015���0r�ȕ\u0018\u0011Z�G��-Y����\\��[�c��\u0018��q1b$�E��h�`�\\\f�������,1�9\u0004�\u0016ۖ�ň��E�D�(���T�p��0r�ȕ\u0018\u0011\u001a6\u0017�es1a$k�*1�$ag:*�3����E�D�hz�������\\�]]@m�'%�'\u0015��ep�I�\\������?�\u0002H�+����O\"��\u001fQ2\u0019���{�@�\u0002W��PǓ�m�\u001c\u0018Q\u00129���v<\u0017T�\u0003]Ӗʁ$�́\u0011%�\u0003�\u0011�\u0013\u000e�Np5\u000etS7T\u000e$�d\u000e�(�\u001c��A;�@�\u000f�Ɓ��M�z)v����V�>�[]BAH_ک�k;5}q'[��vagw���e���ݸ\u001e\by\u0012y�1=�v��z�\u001d�\u0001Y$�\u0001#J\u001e\u0003Rj�\u0013\u0006�̮\u001a\u0003�4�ʀ,�ˀ\u0011%�\u0001)��\t\u0003r^W�\u0001]S�ʀ,�ˀ\u0011%�\u0001�ͳ\u0013\u0006�.O5\u0006�#\u0000����/v�����w9*�P���?#lbbγ|΢O\u0012o������m�Er�?���?��� @���b@�kh��]�\u0001d���.^\u0004H��>\u0018Ѝ��\u0018��5����+�4}\u0005\u0016W�[/���x��!\u0013�k�\u0012F�\u0016�\u0012#Bk�Z�׺\u0012�h-[�\\��1����y�\u0011�Č�\t�$љ,�=�x��bj�\u0019[�w���O\"��\u001fQ��/T�\u000b\u0001Dj�Uc@Ԭf�EF�Ӣ��\u001cD�\u0012�l�~�D��hD��#��^��H]я��3��\u0019`�\fp�a��ɀ�%%�\u0001I$�\u0001#J\u001e\u0003T?<�\u0001J��\fp��\u001f&\"���\n�X�a$o�E�\u0018)�-�$��b�0r�ȕ\u0018\u0011�\u001f~M\u0012�\u000f�v\u0018�Z�J4�\u0011/M�\u0016\"�t�:\u00171�l\u001ckcN��,�=\u0013 [...]
  1. Here is the error I am receiving:
{
  "error": {
    "code": 400,
    "message": "Invalid JSON payload received. Unknown name \"content\" at 'document': Proto field is not repeating, cannot start list.",
    "status": "INVALID_ARGUMENT",
    "details": [
      {
        "@type": "type.googleapis.com/google.rpc.BadRequest",
        "fieldViolations": [
          {
            "field": "document",
            "description": "Invalid JSON payload received. Unknown name \"content\" at 'document': Proto field is not repeating, cannot start list."
          }
        ]
      }
    ]
  }
}

2021-01-05 adding code to show how encoding is perfomed:

//
//Function to call each url in an array of urls
//
const requestAsync = function(url) {
    return z.request(url).then((response) => response.content)
}
//
//Create the array of urls to call synchronously
//
var urlArr = [];
const urls = {
  url: 'https://storage.googleapis.com/cloud-samples-data/documentai/loan_form.pdf',
  method: 'GET',
  headers: {
    'Accept': 'application/pdf',
    'raw': true
  }
}
urlArr.push(urls);
//
//Call the function for each item in the urlArr
//
return Promise.all(urlArr.map(requestAsync))
 .then(function(values){
    //
    // Convert the file data to a Buffer and base64 encode it.
    //
    var fileContent = Buffer.from(values[0]).toString('base64');

    const options = {
    url: 'https://us-documentai.googleapis.com/v1beta3/projects/38072577434/locations/us/processors/cd8a06d0cd3cb045:process',
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      'Accept': 'application/json',
      'Authorization': `Bearer ${bundle.authData.access_token}`
    },
    body: {
        document: {
          mimeType: 'application/pdf',
          content: fileContent
        }
      }
  };
  return z.request(options)
    .then((response) => {
      response.throwForStatus();
      const result = response.json;
    // Get all of the document text as one big string
    const {text} = result;
    // Extract shards from the text field
    const getText = textAnchor => {
      // First shard in document doesn't have startIndex property
      const startIndex = textAnchor.textSegments[0].startIndex || 0;
      const endIndex = textAnchor.textSegments[0].endIndex;
      return text.substring(startIndex, endIndex);
    };
/*    // Process the output
    const [page1] = result.pages;
    const {formFields} = page1;
    var fieldList = "";
    for (const field of formFields) {
      var fieldName = getText(field.fieldName.textAnchor);
      var fieldValue = getText(field.fieldValue.textAnchor);
      fieldName = fieldName.replace(/\n/g,'');
      fieldValue = fieldValue.replace(/\n/g,'');
      fieldList += `"${fieldName}": "${fieldValue}"`;
    z.console.log(`\t(${fieldName}, ${fieldValue})`);
    }
*/
  //z.console.log(fieldList)
      return {getText};

    });
 });
David
  • 55
  • 6

2 Answers2

0

It looks like the "content" you used on your request is not in base64. If you are using Linux, you can use the command base64.

base64 your_pdf_to_use.pdf > base64_of_your_pdf.txt

Or you can just use any base64 converter. I saw this online pdf to base64 converter and it works for me as well.

When checking base64 output it should not have any recognizable text/words. I tried using the sample file in the documentAI quickstart. Here is a snippet of a base64 output.

JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzky IDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAv VHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8 PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAg XSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzYgICAgICAgICAg ICAgICAgIC9JRCBbPGFiYjQ5MjJhYTY5N2NmZDJiODVjYjY5YjNhZGI4MDZmPjxhYmI0OTIyYWE2 OTdjZmQyYjg1Y2I2OWIzYWRiODA2Zj5dID4+CnN0cmVhbQp4nGNiZOBnYGJgOAkkmPiABKMRiNsG YjEACcHDQELhCEhWBkiICYIkpgEJ9ocgliGQEAErrmBgYpwqAdLLwEgxAQD5KwddCmVuZHN0cmVh bQplbmRvYmoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg....

A snippet of my request.json:

{
  "document": {
    "mimeType": "application/pdf",
    "content": "JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzky
IDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAg
ICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAv
VHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8
PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAg
XSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzYgICAgICAgICAg
ICAgICAgIC9JRCBbPGFiYjQ5MjJhYTY5N2NmZDJiODVjYjY5YjNhZGI4MDZmPjxhYmI0OTIyYWE2
OTdjZmQyYjg1Y2I2OWIzYWRiODA2Zj5dID4+CnN0cmVhbQp4nGNiZOBnYGJgOAkkmPiABKMRiNsG...
}
}

Curl request:

curl -X POST -H "Authorization: Bearer "$(gcloud auth application-default print-access-token) -H "Content-Type: application/json; charset=utf-8" -d @request.json https://us-documentai.googleapis.com/v1beta3/projects/xxxxxxx/locations/us/processors/xxxxxx:process > result.json

Here is the snippet of the output when I used the file from the quick start using endpoint form parser:

enter image description here

EDIT: 20210106

I did try accessing the file using GET and I got the base64 value cleanly using your current request in urls. But I found a SO post about converting files to base64 and says to

add encoding: null on request options so that you will surely receive a Buffer and not a String

Adding encoding: null worked for me as well. It is worth a shot.

Here is a snippet of my code for GET and encode to base64:

    const request_img = require('request');
    const urls = {
          url: 'https://storage.googleapis.com/cloud-samples-data/documentai/loan_form.pdf',
          method: 'GET',
          encoding: null,
          headers: {
            'Accept': 'application/pdf',
            'raw': true
          }
        }
        var urlArr = [];
        urlArr.push(urls);
        
        request_img(urlArr[0], function(err, res, body) {
           var converted_to_base64 = Buffer.from(body).toString('base64');
           console.log(converted_to_base64);
                  });

Here is the snippet of the output. I got the file encoded to base64:

JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzkyIDEzNCBdIC9PIDYgL0UgMjg3NzYgL04gMSAvVCAyODc3NSA+PgplbmRvYmoKICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAKMyAwIG9iago8PCAvVHlwZSAvWFJlZiAvTGVuZ3RoIDcwIC9GaWx0ZXIgL0ZsYXRlRGVjb2RlIC9EZWNvZGVQYXJtcyA8PCAvQ29sdW1ucyA0IC9QcmVkaWN0b3IgMTIgPj4gL1cgWyAxIDIgMSBdIC9JbmRleCBbIDIgMzAgXSAvSW5mbyAxNyAwIFIgL1Jvb3QgNCAwIFIgL1NpemUgMzIgL1ByZXYgMjg3NzY

By the way the version of my NodeJS is v10.14.2

Ricco D
  • 6,873
  • 1
  • 8
  • 18
  • That is what I needed. Found out that I was not converting to the base64 string...Did this and I am getting the Document AI response correctly. – David Jan 04 '21 at 14:27
  • Well I spoke too soon... When I tried using the sample PDF file in the documentAI quickstart, convert it to base64 using the pdf to base64 converter, copy/paste the base64 content directly into the Body, the response returns as you have show, However, when I retrieve the sample file using a Get, convert it using `var fileContent = new Buffer(values[0]).toString('base64');` I continue to receive the same error message: `{ "error": { "code": 400, "message": "Unsupported input file format.", "status": "INVALID_ARGUMENT" } }` – David Jan 04 '21 at 15:11
  • @David Are you using NodeJS? If you are using NodeJS, the quick start has a sample implementation on how to encode your file to base64 in [NodeJS](https://cloud.google.com/document-ai/docs/form-parser#documentai_process_document-nodejs). This line in particular `const encodedImage = Buffer.from(imageFile).toString('base64');`. variable imageFile being `const filePath = '/path/to/local/pdf'; const fs = require('fs').promises;const imageFile = await fs.readFile(filePath);` This worked for me as well and I was able to print the base64 encoded value which is stored in `encodedImage`. – Ricco D Jan 05 '21 at 05:35
  • But if you are not using NodeJS, can you edit your original post to include your code snippet on how you are sending a request? – Ricco D Jan 05 '21 at 05:35
  • I have added the NodeJS code so you can ss that I am encoding the PDF the same as the example from the Document AI Quickstart. The flow of my code is: 1. Get the PDF document using GET call 2. Convert the PDF from the GET to Base64 3. Use the encoded PDF to call the Document AI API 4. Parse the response into the Key Value Pairs – David Jan 05 '21 at 12:09
  • What I think is happening is the encoding of a file is somehow different that the encoding of the GET response. I sent the encoded PDF to an endpoint where I could review the Base64 string and found it is different than the one converted using "pdf to base64 converter". When the PDF is returned from the GET what is the encoding. My assumption is UTF-8. – David Jan 05 '21 at 12:14
  • Here is an example of the encoded PDF when sent to an endpoint where I can view the Body: `{ "document": { "mimeType": "application/pdf", "content": "JVBERi0xLjUKJe+/ve+/ve+/ve+/vQoyIDAgb2JqCjw8IC9MaW5lYXJpemVkIDEgL0w==" } }` – David Jan 05 '21 at 12:30
  • Here is the encoded PDF from "pdf to base64 converter": `JVBERi0xLjUKJb/3ov4KMiAwIG9iago8PCAvTGluZWFyaXplZCAxIC9MIDI5MDUxIC9IIFsgNzkyID` – David Jan 05 '21 at 12:31
  • As you can see the encoding is different...any ideas? – David Jan 05 '21 at 12:32
  • @David I edited my answer to address your question about about encoding to base64. – Ricco D Jan 06 '21 at 06:31
0

The Document AI Documentation has been updated to include base64 encoding conversion for the Node.js samples

https://cloud.google.com/document-ai/docs/process-documents-client-libraries#client-libraries-usage-nodejs

You can also check out this Codelab for the Form Parser using Node.js. Most of the actual processing request will be the same for every processor.

https://codelabs.developers.google.com/codelabs/docai-form-parser-node#7

Holt Skinner
  • 1,692
  • 1
  • 8
  • 21