0

I'm using Elasticsearch DSL, I'm trying to use a query result as a parameter for another query like below:

{
  "query": {
    "bool": {
      "must_not": {
        "terms": {
          "request_id": {
            "query": {
              "match": {
                "processing.message": "OUT Followup Synthesis"
              }
            },
            "fields": [
              "request_id"
            ],
            "_source": false
          }
        }
      }
    }
  }
}

As you can see above I'm trying to search for sources that their request_id is not one of the request_idswith processing.message equals to OUT Followup Synthesis.

I'm getting an error with this query:

Error loading data [x_content_parse_exception] [1:1660] [terms_lookup] unknown field [query]

How can I achieve my goal using Elasticsearch DSL?

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
Ismail
  • 2,322
  • 1
  • 12
  • 26
  • I don't follow. What's the relation between `request_id` and `processing.message` such that you cannot do a normal `must_not` query and retrieve just the sources' `request_ids`? – Joe - GMapsBook.com Feb 18 '21 at 17:28
  • @JoeSorocin, the request_id, and processing.message are two fields in the same object. – Ismail Feb 18 '21 at 17:40
  • I understand that but no `request_id` will be equal to `OUT Followup Synthesis`, right? In which case, why mention `request_id` in the query at all when you can directly query `processing.message` and simply return the matching sources' `request_ids`? Or am I missing something here? – Joe - GMapsBook.com Feb 18 '21 at 18:18
  • 1
    As I said, I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language: SELECT d FROM Data d WHERE d.processing.message = 'IN Followup Sythesis' AND d.request_id NOT IN (SELECT request_id FROM Data WHERE processing.message = 'OUT Followup Sythesis'); – Ismail Feb 18 '21 at 20:59
  • I gotcha now! My answer will follow. – Joe - GMapsBook.com Feb 18 '21 at 22:36

1 Answers1

2

Original question extracted from the comments

I'm trying to fetch data with processing.message equals to 'IN Followup Sythesis' with their request_id doesn't appear in data with processing.message equals to 'OUT Followup Sythesis'. In SQL language:

SELECT d FROM   data d
WHERE  d.processing.message = 'IN Followup Sythesis'
       AND d.request_id NOT IN (SELECT request_id FROM data WHERE processing.message = 'OUT Followup Sythesis'); 

Answer: generally speaking, neither application-side joins nor subqueries are supported in Elasticsearch.

So you'll have to run your first query, take the retrieved IDs and put them into a second query — ideally a terms query.


Of course, this limitation can be overcome by "hijacking" a scripted metric aggregation.

Taking these 3 documents as examples:

POST reqs/_doc
{"request_id":"abc","processing":{"message":"OUT Followup Synthesis"}}

POST reqs/_doc
{"request_id":"abc","processing":{"message":"IN Followup Sythesis"}}

POST reqs/_doc
{"request_id":"xyz","processing":{"message":"IN Followup Sythesis"}}

you could run

POST reqs/_search
{
  "size": 0,
  "query": {
    "match": {
      "processing.message": "IN Followup Sythesis"
    }
  },
  "aggs": {
    "subquery_mock": {
      "scripted_metric": {
        "params": {
          "disallowed_msg": "OUT Followup Synthesis"
        }, 
        "init_script": "state.by_request_ids = [:]; state.disallowed_request_ids = [];",
        "map_script": """
          def req_id = params._source.request_id;
          def msg = params._source.processing.message;
          
          if (msg.contains(params.disallowed_msg)) {
            state.disallowed_request_ids.add(req_id);
            // won't need this particular doc so continue looping
            return;
          }
          
          if (state.by_request_ids.containsKey(req_id)) {
            // there may be multiple docs under the same ID
            // so concatenate them
            state.by_request_ids[req_id].add(params._source);
          } else {
            // initialize an appendable arraylist
            state.by_request_ids[req_id] = [params._source];
          }
        """,
        "combine_script": """
          state.by_request_ids.entrySet()
            .removeIf(entry -> state.disallowed_request_ids.contains(entry.getKey()));
          return state.by_request_ids
        """,
        "reduce_script": "return states"
      }
    }
  }
}

which'd return only the correct request:

"aggregations" : {
  "subquery_mock" : {
    "value" : [
      {
        "xyz" : [
          {
            "processing" : { "message" : "IN Followup Sythesis" },
            "request_id" : "xyz"
          }
        ]
      }
    ]
  }
}

⚠️ This is almost guaranteed to be slow and goes against the suggested guidance of not accessing the _source field. But it also goes to show that subqueries can be "emulated".

I'd recommend to test this script on a smaller set of documents before letting it target your whole index — maybe restrict it through a date range query or similar.


FYI Elasticsearch exposes an SQL API, though it's only offered through X-Pack, a paid offering.

Joe - GMapsBook.com
  • 15,787
  • 4
  • 23
  • 68
  • Thank you very much, it works, that exactly what I want. – Ismail Feb 19 '21 at 12:57
  • Nice, you're welcome! Hey BTW I'm launching my elasticsearch handbook very soon and I think you'd find it useful. [Let me know](https://jozefsorocin.typeform.com/to/XeQRxdwV) what you'd like to learn about and I'll let you know as soon as it's out! – Joe - GMapsBook.com Feb 19 '21 at 14:11