1

I am creating an indexer that takes a document, runs the KeyPhraseExtractionSkill and outputs it back to the index.

For many documents, this works out of the box. But for those records which are over 50,000, this does not work. OK, no problem; this is clearly stated in the docs.

What the docs suggest is so use the Text Split Skill. What I've done is use the Text Split skill, split the original document into pages, pass all pages to the KeyPhraseExtractionSkill. Then we need to merge them back, as we'd end up with an array of arrays of strings. Unfortunately, it seems that the Merge Skill does not accept an array of arrays, just an array.

https://i.stack.imgur.com/8UmYj.png <- Link to the skillset hierarchy.

This is the error reported by Azure:

Required skill input was not of the expected type 'StringCollection'. Name: 'itemsToInsert', Source: '/document/content/pages/*/keyPhrases'. Expression language parsing issues:

What I want to achieve in the end of the day is to run the KeyPhraseExtractionSkill for text which is larger than 50,000 to add it back to the index eventually.

JSON for skillset

  "@odata.context": "https://-----------.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8D957466A2C1E47\"",
  "name": "devalbertcollectionfilesskillset2",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "SplitSkill",
      "description": null,
      "context": "/document/content",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "name": "EntityRecognitionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "categories": [
        "person",
        "quantity",
        "organization",
        "url",
        "email",
        "location",
        "datetime"
      ],
      "defaultLanguageCode": "en",
      "minimumPrecision": null,
      "includeTypelessEntities": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "persons",
          "targetName": "people"
        },
        {
          "name": "organizations",
          "targetName": "organizations"
        },
        {
          "name": "entities",
          "targetName": "entities"
        },
        {
          "name": "locations",
          "targetName": "locations"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "name": "KeyPhraseExtractionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "defaultLanguageCode": "en",
      "maxKeyPhraseCount": null,
      "modelVersion": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "keyPhrases",
          "targetName": "keyPhrases"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "Merge Skill - keyPhrases",
      "description": null,
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "itemsToInsert",
          "source": "/document/content/pages/*/keyPhrases"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "keyPhrases"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "key": "------",
    "description": "/subscriptions/13abe1c6-d700-4f8f-916a-8d3bc17bb41e/resourceGroups/mde-dev-rg/providers/Microsoft.CognitiveServices/accounts/mde-dev-cognitive"
  },
  "knowledgeStore": null,
  "encryptionKey": null
}```

Please let me know if there is anything else that I can add to improve the question. Thanks!


  [1]: https://i.stack.imgur.com/GNf7F.png
Albert Herd
  • 433
  • 1
  • 7
  • 13
  • 1
    May want to remove your cognitive service key ;) as for the solution, the straightforward would be to do two merges, one for every array of key phrases on each page, then another merge once all the pages have one keyphrase text (since they got merged) – arynaq Aug 04 '21 at 13:27
  • See https://stackoverflow.com/questions/61491809/azure-cognitive-search-text-translation-skill-50k-charactes-limitation. – Jennifer Marsman - MSFT Aug 04 '21 at 14:58
  • Hi @JenniferMarsman-MSFT, thanks for your comment. In fact, I started from that question and used it as reference. In my Skills (noted in JSON above), I did use it - I'm passing in the keyPhrases and expect it as merged into KeyPhrases. But the skill doesn't accept this, as it seems that it doesn't like an array of arrays (Required skill input was not of the expected type 'StringCollection') – Albert Herd Aug 04 '21 at 18:34

1 Answers1

1

You don't have to merge the key phrase outputs to insert them to the index.

Assuming your index already has a field called mykeyphrases of type Collection(Edm.String), to populate it with the key phrase outputs, add this indexer output field mapping:

"outputFieldMappings": [
  ...

  {
    "sourceFieldName": "/document/content/pages/*/keyPhrases/*",
    "targetFieldName": "mykeyphrases"
  },

  ...
]

The /* at the end of sourceFieldName is important to flattening the array of arrays of strings. This will also work as the skill input if you want to pass an array of strings to another skill for other enrichments.

8163264128
  • 747
  • 4
  • 8