9

Is it possible in ElasticSearch to form a query that would preserve the ordering of the terms?

A simple example would be having these documents indexed using standard analyzer:

  1. You know for search
  2. You know search
  3. Know search for you

I could query for +you +search and this would return me all documents, including the third one.

What if I wanted to only retrieve the documents which have the terms in this specific order? Can I form a query that would do that for me?

Considering it is possible for phrases by simply quoting the text: "you know" (retrieve 1st and 2nd docs) it feels to me like there should be a way of preserving the order for multiple terms that aren't adjacent.

In the above simple example I could use proximity searches, but this doesn't cover more complex cases.

Artur
  • 3,284
  • 2
  • 29
  • 35

3 Answers3

15

You could use a span_near query, it has a in_order parameter.

{
    "query": {
        "span_near": {
            "clauses": [
                {
                    "span_term": {
                        "field": "you"
                    }
                },
                {
                    "span_term": {
                        "field": "search"
                    }
                }
            ],
            "slop": 2,
            "in_order": true
        }
    }
}
Dan Tuffery
  • 5,874
  • 29
  • 28
  • Thought to add some query_string with wildcards, this one must be times better! – Victor Suzdalev Oct 30 '14 at 21:49
  • Thanks a lot. It does work. Though I wish there was a way to do it without specifying slop value though I know I could cheat by setting slop so some high value. – Artur Oct 31 '14 at 10:39
6

Phrase matching doesn't ensure order ;-). If you specify enough slopes -like 2, for example - "hello world" will match "world hello". But this is not necessarily a bad thing because usually searches are more relevant if two terms are "close" to each other and it doesn't matter their order. And I don't think authors of this feature thought of matching words that are 1000 slops apart.

There is a solution that I could find to keep the order, not simple though: using scripts. Here's one example:

POST /my_index/my_type/_bulk
{ "index": { "_id": 1 }}
{ "title": "hello world" }
{ "index": { "_id": 2 }}
{ "title": "world hello" }
{ "index": { "_id": 3 }}
{ "title": "hello term1 term2 term3 term4 world" }

POST my_index/_search
{
  "query": {
    "filtered": {
      "query": {
        "match": {
          "title": {
            "query": "hello world",
            "slop": 5,
            "type": "phrase"
          }
        }
      },
      "filter": {
        "script": {
          "script": "term1Pos=0;term2Pos=0;term1Info = _index['title'].get('hello',_POSITIONS);term2Info = _index['title'].get('world',_POSITIONS); for(pos in term1Info){term1Pos=pos.position;}; for(pos in term2Info){term2Pos=pos.position;}; return term1Pos<term2Pos;",
          "params": {}
        }
      }
    }
  }
}

To make the script itself more readable, I am rewriting here with indentations:

term1Pos = 0;
term2Pos = 0;
term1Info = _index['title'].get('hello',_POSITIONS);
term2Info = _index['title'].get('world',_POSITIONS);
for(pos in term1Info) {
  term1Pos = pos.position;
}; 
for(pos in term2Info) {
  term2Pos = pos.position;
}; 
return term1Pos < term2Pos;

Above is a query that searches for "hello world" with a slop of 5 which in the docs above will match all of them. But the scripted filter will ensure that the position in document of word "hello" is lower than the position in document for word "world". In this way, no matter how many slops we set in the query, the fact that the positions are one after the other ensures the order.

This is the section in the documentation that sheds some light on the things used in the script above.

Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • +1 for the effort and the solution. I'm still hoping there might a more elegant/less hacky solution to my problem ;) – Artur Oct 29 '14 at 17:13
  • I'm curious, as well, about some other solution. This was the only approach I could think of. – Andrei Stefan Oct 29 '14 at 17:21
  • Actually having tried running it I'm getting `QueryPhaseExecutionException` that seems to be caused by `GroovyScriptExecutionException[MissingPropertyException[No such property: title for class: Script4]];` I'm running v1.4 beta. – Artur Oct 30 '14 at 10:47
  • `title` is just the name of the field on which you want your `hello` and `world` to find matches in. In my sample the name of the field was `title`. – Andrei Stefan Oct 30 '14 at 11:28
4

This is exactly what a match_phrase query (see here) does.

It checks the position of the terms, on top of their presence.

For example, these documents :

POST test/values
{
  "test": "Hello World"
}

POST test/values
{
  "test": "Hello nice World"
}

POST test/values
{
  "test": "World, I don't say hello"
}

will all be found with the basic match query :

POST test/_search
{
  "query": {
    "match": {
      "test": "Hello World"
    }
  }
}

But using a match_phrase, only the first document will be returned :

POST test/_search
{
  "query": {
    "match_phrase": {
      "test": "Hello World"
    }
  }
}

{
   ...
   "hits": {
      "total": 1,
      "max_score": 2.3953633,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "qFZAKYOTQh2AuqplLQdHcA",
            "_score": 2.3953633,
            "_source": {
               "test": "Hello World"
            }
         }
      ]
   }
}

In your case, you want to accept to have some distance between your terms. This can be achieved with the slop parameter, which indicate how far you allow your terms to be one from another :

POST test/_search
{
  "query": {
    "match": {
      "test": {
        "query": "Hello world",
        "slop":1,
        "type": "phrase"
      }
    }
  }
}

With this last request, you find the second document too :

{
   ...
   "hits": {
      "total": 2,
      "max_score": 0.38356602,
      "hits": [
         {
            "_index": "test",
            "_type": "values",
            "_id": "7mhBJgm5QaO2_aXOrTB_BA",
            "_score": 0.38356602,
            "_source": {
               "test": "Hello World"
            }
         },
         {
            "_index": "test",
            "_type": "values",
            "_id": "VKdUJSZFQNCFrxKk_hWz4A",
            "_score": 0.2169777,
            "_source": {
               "test": "Hello nice World"
            }
         }
      ]
   }
}

You can find a whole chapter about this use case in the definitive guide.

ThomasC
  • 7,915
  • 2
  • 26
  • 26
  • In the OP's example he would need to use the slop value too. – Dan Tuffery Oct 29 '14 at 15:57
  • What if the words were an arbitrary distance apart? `hello one two... n terms world`? If the slop is high eventually it would match "World, I don't say hello" too. Also slop 1 would match "world hello", right? – Artur Oct 29 '14 at 16:25
  • 4
    Phrase matching doesn't ensure order ;-). If you specify enough slopes -like 2, for example - "hello world" will match "world hello". – Andrei Stefan Oct 29 '14 at 16:29
  • Yeah, @AndreiStefan and that sucks. Preserving ordering would be so useful in my use case. – Artur Oct 29 '14 at 16:46
  • Nice answer. I had the same issue. This really helpful. – Aviro Aug 01 '18 at 00:56