5

I am training a model using Google's Document AI. The training fails with the following error (I have included only a part of the JSON file for simplicity but the error is identical for all documents in my dataset):

"trainingDatasetValidation": {
      "documentErrors": [
        {
          "code": 3,
          "message": "Invalid document.",
          "details": [
            {
              "@type": "type.googleapis.com/google.rpc.ErrorInfo",
              "reason": "INVALID_DOCUMENT",
              "domain": "documentai.googleapis.com",
              "metadata": {
                "num_fields": "0",
                "num_fields_needed": "1",
                "document": "5e88c5e4cc05ddb8.json",
                "annotation_name": "INCOME_ADJUSTMENTS",
                "field_name": "entities.text_anchor.text_segments"
              }
            }
          ]
        }

What I understand from this error is that the model expects the field INCOME_ADJUSTMENTS to appear (at least) once in the document but instead, it finds zero instances of it.

That would have been understandable except I have already defined the field INCOME_ADJUSTMENTS in my schema as "Optional Once", i.e., this field can appear either zero or one time.

list of schema options

Am I missing something? Why does this error persist despite the fact that it is addressed in the schema?

p.s. I have also tried "Optional multiple" (and "Required once" and "Required multiple") and the error persists.

EDIT: As requested, here's what one of the JSON files looks like. Note that there is no PII here as the details (name, SSN, etc.) are synthetic data.

Aventinus
  • 1,322
  • 2
  • 15
  • 33
  • You're right, this doesn't make sense, since that field shouldn't need to be present in the documents with `OPTIONAL_ONCE` set. Would you be able to share a specific Document.json file that caused this issue? (With any PII redacted) – Holt Skinner Jan 13 '23 at 16:02
  • 1
    @HoltSkinner Thank you. I have edited the original post to include a link to one of the JSON files. – Aventinus Jan 13 '23 at 16:26
  • 3
    Thanks, Looking this over with members of the product team to investigate – Holt Skinner Jan 13 '23 at 16:39
  • Ok, further question on this. Looks like the error is because the field `Entity.textAnchor.textSegments` isn't populated in the Document.json for the INCOME_ADJUSTMENTS field. Which I'm not sure why it's not populated, it is filled out for most other fields. What type of data should INCOME_ADJUSTMENTS be? Is it plain text, money, checkbox etc? And how did you create these Document.json files? Was it in the Workbench labeling tool, Human in the Loop, or something else? – Holt Skinner Jan 31 '23 at 16:54
  • @HoltSkinner Same issue as OP, documents uploaded to the workbench, labelled using the online tool, primarily avoiding the text selection tool due to irregular document formatting. After seeing the error the first time, edited some "required once" to "optional once" and saved. After using the schema editor it warns that it will be applied to the existing documents but does not appear to be doing so. – Stephen Jan 31 '23 at 21:48
  • 1
    @HoltSkinner I have somewhat the same issue: the issue that I mainly have with this problem is that, like 2 weeks ago, I had 0 conflicts. Then I added about 50-80 images. Now I have 140 conflicts "invalid documents"; which is crazy to me. I really think GCP have pushed a bug with their new UI – Andrei Tulbure Feb 01 '23 at 08:42
  • 1
    @AndreiTulbure My team and I are of the same mind. The error I posted in this thread was only the first one. After that, this behavior persisted in several "sanity check" experiments that we run. In a nutshell, we would get errors that made no sense, even when we used data that previously worked (e.g., we would export a labeled dataset to another processor and the new processor would not train even though it did with the first processor). – Aventinus Feb 01 '23 at 09:12
  • 1
    @Aventinus Yes, I got the same experience. Also, what happened is that they updates the UI last weekend and I think maybe they did not properly test some parts of it ? When I sometimes try to correct some labels (like the docAI got it wrong and I want to manually correct it), I open the Web console from my browser and I get this error: "m=b:1006 ERROR TypeError: Cannot read properties of null (reading 'getAttribute') at DO.attr " Which is also weird to me, so I think they may have shortcut testing a little bit for the web app – Andrei Tulbure Feb 01 '23 at 09:32
  • @Aventinus any updates on your part ? Mine still is messed up – Andrei Tulbure Feb 06 '23 at 11:45
  • @AndreiTulbure No updates, unfortunately. – Aventinus Feb 06 '23 at 16:28
  • I submitted several support tickets, since their UI update, I have encountered 3 pretty serious bugs. No serious response for any of them sadly – Andrei Tulbure Feb 06 '23 at 16:56

4 Answers4

2

I have/had the same issue as you in the past and also having it right now.

What I managed to do was to get the document string from the error message and then searching for the images in the Storage bucket that has the dataset.

Then I opened the image and searched for that image in my 1000+ images dataset.

Then I deleted the bounding box for the label with the issue and then relabeled it. This seemed to solve 90%of the issues I had.

It`s a ton of manual work and I wish google thought of more when they released the Web app for Doc AI because the ML part is great but the app is really lackluster.

I would also be very happy for any other fixes

EDIT: another quicker workaround I have found is deleting the latest revision of the labeled documents from the Dataset in cloud storage. Like, take faulty document name from the operation json dump, then search for it in documents/ and then just delete latest revision.

Will probably mess up labeling and make you lose work, but it`s a quick fix to at least make some progress if you want.

  • A better solution than delete all, but in my case the label exists in a child element so it's not clear which of the child elements is to blame. I'm wondering *when* the schema's requirements are being applied and why the edit would affect those rules. – Stephen Feb 01 '23 at 03:13
  • 1
    Yeah, the issue that I mainly have with this problem is that, like 2 weeks ago, I had 0 conflicts. Then I added about 50-80 images. Now I have 140 conflicts "invalid documents"; which is crazy to me. I really think GCP have pushed a bug with their new UI. – Andrei Tulbure Feb 01 '23 at 08:41
  • 2
    The product development team is working on a fix for this issue. The problem is that the Labeling UI allows creation of invalid document labels in certain edge cases. – Holt Skinner Feb 01 '23 at 17:11
  • 1
    @HoltSkinner i have checked by labels, and they have proper values. Still, I'm getting this error while training. – Nikhil Pareek Feb 02 '23 at 13:32
  • 1
    @HoltSkinner any ETA for the fix? :)) – Andrei Tulbure Feb 02 '23 at 15:08
0

I had this problem with "internal error" when I had bounding boxes that intersected each other. Check your definitions and remove any labels that have boxes crossing each other. The error does not give any hints to what document has the problem, unfortunately, so you might have to scroll through them all.

Also, I had at some bounding boxes on empty fields. I do not know if this affected the error, but I also removed then along with the intersecting boxes.

After this, I could run the training process without errors.

  • 1
    how did you find them ? Didn`t you encounter an internal error ? "" – Andrei Tulbure Feb 14 '23 at 08:28
  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Feb 16 '23 at 00:38
0

This bug may have been fixed. I am now seeing an error message "Cannot create labels with empty values" on images that previously generated the error.

Stephen
  • 1,607
  • 2
  • 18
  • 40
-1

i had the same problem. so i deleted all my dataset and imported and re-labeled again. then the training worked fine.

heryk2232
  • 17
  • 3
  • 2
    Re-importing and re-labeling from scratch cannot possibly be considered a fix. What if I have already manually labeled hundreds/thousands of documents? This is days of work. – Aventinus Jan 25 '23 at 11:56
  • yes, in my case the data set was jut 20. so i could do it, but in case of hundreds/thousands its better find another way. – heryk2232 Jan 26 '23 at 06:28
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jan 30 '23 at 10:58