3

Sometimes it's useful to allow numbers to be treated as keywords or strings when using a search index. For example, suppose I have transaction data something like this:

[
  { "amount": 715,  "description": "paypal payment" },
  { "amount": 7500, "description": "second visa payment" },
  { "amount": 7500, "description": "third visa payment" }
]

I might want to allow a search box entry such as "7500 second" to produce the last two rows, with the "second visa payment" row scoring highest.

How can I achieve this with Mongo DB Atlas, using its search index facility?

In Elastic Search, it's possible by adding a keyword field on the numeric field, as per this example:

INDEX=localhost:9200/test
curl -X DELETE "$INDEX?pretty"
curl -X PUT "$INDEX?pretty" -H 'Content-Type: application/json' -d'
{
  "mappings" : {
    "properties" : {
      "amount" : {
        "type" : "long",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      },
      "description" : {
        "type" : "text",
        "fields" : {
          "keyword" : {
            "type" : "keyword",
            "ignore_above" : 256
          }
        }
      }
    }
  }
}'
curl -X POST "$INDEX/_bulk?pretty" -H 'Content-Type: application/x-ndjson' -d '
{ "index": {"_id":"61d244595c590a67157d5f82"}}
{ "amount": 512,"description": "paypal payment" }
{ "index": {"_id":"61d244785c590a67157d62b3"}}
{ "amount": 7500, "description": "second visa payment" }
{ "index": {"_id":"61d244785c590a67157d62b4"}}
{ "amount": 7500, "description": "third visa payment" }
'
sleep 1

curl -s -X GET "$INDEX/_search?pretty" -H 'Content-Type: application/json' -d'
{
  "query": {
    "query_string": {
        "query": "75* second"

    }
  }
}
' # | jq '.hits.hits[] | {_source,_score}'

Here the search on "75* second" gives the desired result:

{
  "_source": {
    "amount": 7500,
    "description": "second visa payment"
  },
  "_score": 1.9331132
}
{
  "_source": {
    "amount": 7500,
    "description": "third visa payment"
  },
  "_score": 1
}

With eqivalent data in Mongo Atlas (v5.0), I've tried setting up an index with a lucene.keyword on the "amount" field as a string, but it has no effect on the results (which only pay attention to the description field). Similarly, added a string field type on the amount field doesn't produce any rows: it seems Mongo Atlas Search insists on using number-type queries on numeric fields.

I'm aware that I can use a more complex compound query, combining numeric and string fields, to get the result (example below), but this isn't necessarily convenient for a user, who just wants to chuck terms in a box without worrying about field names. I may wish to search over ALL number fields in a row, rather than just one, and include results where only some of the terms match, potentially fuzzily. (A possible use case here is searching over transaction data, with a question like "when was my last payment for about 200 dollars to Steven?" in mind).

One possibility might be to create an "all text" field in the mongo DB, allowing the numbers to be stored as strings, and similar to what happens (or used to happen) in Elastic Search. This might require a materialized view on the data, or else an additional, duplicative field, which then would be indexed.... is there an easier solution, or one that involves less data duplication? (the table in question is large, so storage costs matter).

The data in mongo look something like this. amount could be a float or an integer (or likely both, in different fields).

{"_id":{"$oid":"61d244595c590a67157d5f82"},"amount":{"$numberInt":"512"},"description":"paypal payment"}
{"_id":{"$oid":"61d244785c590a67157d62b3"},"amount":{"$numberInt":"7500"},"description":"second visa payment"}
{"_id":{"$oid":"61d244785c590a67157d62b4"},"amount":{"$numberInt":"7500"},"description":"third visa payment"}

An example of a search index definition I've tried (among many!) is:

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "amount": {
        "multi": {
          "test": {
            "analyzer": "lucene.keyword",
            "ignoreAbove": null,
            "searchAnalyzer": "lucene.keyword",
            "type": "string"
          }
        },
        "type": "string"
      },
      "description": {
        "type": "string"
      }
    }
  },
  "storedSource": true
}

...and a sample search pipeline is:

[
  {
    "$search": {
      "index": "test",
      "text": {
        "path": {
          "wildcard": "*"
        },
        "query": "7500 second"
      }
    }
  },
  {
    "$project": {
      "_id": 1,
      "description": 1,
      "amount": 1,
      "score": {
        "$meta": "searchScore"
      }
    }
  }
]

This gives only the second row (i.e. the "7500" in the query is effectively ignored, and only the description field matches):

[
  {
    "_id": "61d244785c590a67157d62b3",
    "amount": 7500,
    "description": "second visa payment",
    "score": 0.42414236068725586
  }
]

The following compound query does work, but it's overly complex to produce, especially with many numeric and string fields:

{
  "index": "test",
  "compound": {
    "should": [
      {
        "text": {
          "query": "second",
          "path": "description"
        }
      },
      {
        "near": {
          "path": "amount",
          "origin": 7500,
          "pivot": 1
        }
      }
    ]
  }
}

Documentation on field types and mappings is at https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings/, operators and collectors at https://www.mongodb.com/docs/atlas/atlas-search/operators-and-collectors/ .

See https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html for Elastic's guidance on why and when it can be useful to index numeric fields as keywords.

phhu
  • 1,462
  • 13
  • 33
  • What's happening in Elasticsearch in your example may seem like a positive thing but it probably is not long term. The types here are muddled. – Nice-Guy May 13 '22 at 11:50
  • "probably not in the long term": perhaps, but it's intuitive, and a similar thing works for google, who are doing ok.... e.g. https://www.google.com/search?q=bitcoin+30000 . I want to keep strict types in the source data (hence my reluctance to add say a string field containing the numbers), but I'm happy for a search index to show flexibility in type. – phhu May 13 '22 at 11:54

1 Answers1

2

In Atlas Search, the data type defined in your index definition determines what operators you can use to query the values. In this case, I used the default and the following query options to target the values you are looking for above.

For only the numeric value:

{
  compound: {
    should: [
      {
        range: {
          gt: 7499,
          lte: 7500,
          path: 'amount'
        }
      }
    ]
  }
}

If I want query for both the text and the number, it's also simply a compound query, though an edgeGram autocomplete field type would be desired in an optimal state. It's really important for me to simplify before optimizing:

{
  compound: {
    must: [
      {
        range: {
          gt: 7499,
          lte: 7500,
          path: 'amount'
        }
      },
      {
        wildcard: {
          query: "*",
          path: 'description',
          allowAnalyzedField: true
        }
      }
    ]
  }
}

I hope this is helpful. keyword is only a good analyzer for this field in the case of the description field if you want to do the wildcard. Standard or the language that description is written in would both be better.

Nice-Guy
  • 1,457
  • 11
  • 20
  • 1
    I'm a little unclear on "the data type defined in your index definition determines what operators you can use to query the value": does that mean the data type in the index definition has to be the same as (or similar to: e.g. int32 to float might be ok, but int32 to string might not be) the data type in the index? Then why can I define multiple index types for a given field - as in my example, where I try to index amount (a number in the source documents) as a string (and / or string-like keyword), in order to use it with e.g. a text search operator? Is it just that it doesn't work? – phhu May 13 '22 at 18:29
  • 1
    You need to cast the data to a string in a field of another name like `amount_string` in a trigger. That is the simplest way to be able to use wild card in such a way. – Nice-Guy May 15 '22 at 14:06
  • It's so weird to me that these search engines don't work with numerical values. Like it's not an option to natively have them cast a number to its string equivalent. – evolross May 17 '23 at 14:54
  • So an option would be to `split()` your search query by spaces, then look for successful `parseInt()` calls, and then dynamically add those to `range` compound steps on the database fields that are numbers? That's wild. – evolross May 17 '23 at 14:57