0

When importing a document, I get an error that is attached below.

I guess the problem arose when the data provider (esMapping.js) was changed, to use the integer sub-field to sort documents.

Is it possible to use some pattern to sort the document so that this error does not occur again? Does anyone have an idea?

The question refers to the one already asked - Enable ascending and descending sorting of numbers that are of the keyword type (Elasticsearch)

Error:

022-05-18 11:33:32.5830 [ERROR] ESIndexerLogger Failed to commit bulk. Errors:
index returned 400 _index: adama_gen_ro_importdocument _type: _doc _id: 4c616067-4beb-4484-83cc-7eb9d36eb175 _version: 0 error: Type: mapper_parsing_exception Reason: "failed to parse field [number.sequenceNumber] of type [integer] in document with id '4c616067-4beb-4484-83cc-7eb9d36eb175'. Preview of field's value: 'BS-000011/2022'" CausedBy: "Type: number_format_exception Reason: "For input string: "BS-000011/2022"""

Mapping (sequenceNumber used for sorting):

"number": {
        "type": "keyword",
        "copy_to": [
            "_summary"
        ],
        "fields": {
            "sequenceNumber": {
                "type": "integer"
            }
        }
    }
DevinRa
  • 115
  • 1
  • 7
Petar Pan
  • 23
  • 5

1 Answers1

1

In the returned error message, the value being indexed into the number field is a string with alphabetical characters, 'BS-000011/2022'. This is no problem for the number field that has a keyword type. However, it is an issue for the sequenceNumber sub-field which has an integer type. The text value passed into number is also passed into sequenceNumber sub-field, hence the error.

Unfortunately, the text analyzer used in the previous question won't help either, as sorting can't be performed on a text field. However, the tokenizer used by the custom analyzer document_number_analyzer can be repurposed into an ingest pipeline.

The custom tokenizer, for context, provided by the author in the previous question :

"tokenizer": {
   "document_number_tokenizer": {
      "type": "pattern",
       "pattern": "-0*([1-9][0-9]*)\/",
       "group": 1
    }
}

If the custom analyzer is used, with the Elasticsearch _analyze API on the value above like so (stack_index being a temporary index to use the analyzer) :

POST stack_index/_analyze
{
  "analyzer": "document_number_analyzer",
  "text": ["BS-000011/2022"]
}

The analyzer returns one token of 11, but tokens are for search analysis, not sorting.

An Elasticsearch ingest pipeline, using the grok processor, can be applied to the index to perform the extraction of the desired number from the value and indexed as an integer. The processor needs to be configured to expect the value's format, which would be similar to 'BS-0000011/2022'. An example is provided below:

PUT _ingest/pipeline/numberSort
{
  "processors": [
    {
      "grok": {
        "field": "number",
        "patterns": ["%{WORD}%{ZEROS}%{SORTVALUES:sequenceNumber:int}%{SEPARATE}%{NUMBER}"],
        "pattern_definitions": {
          "SEPARATE":  "[/]",
          "ZEROS" : "[-0]*",
          "SORTVALUES":  "[1-9][0-9]*"
        }
      }
    }
  ]
}

Grok takes an input text value and extracts structured fields from it. The pattern where the sortable number will be extracted is the SORTVALUES pattern, %{SORTVALUES:sequenceNumber:int}. A new field, called sequenceNumber, will be created in the document. When 'BS-000011/2022' is indexed in the number field, 11 is indexed into the sequenceNumber field as an integer.

You can then create an index template to apply the ingest pipeline. The sequenceNumber field will need to be explicitly added as an integer type. The ingest pipeline will automatically index into as long as a value matching the format of the input above is indexed into the number field. The sequenceNumber field will then be available to sort on.

griegite
  • 11
  • 1