13

Issue

I'm trying to save a boolean value to elasticsearch, but it is specifically valid for it to be NULL. A sort of don't-care in this case.

There seem to be several options, but it is not completely clear what the best would be.

We are using ElasticSearch version 5.0.2

Option 1

The trivial one would be to save it as a boolean with NULL values. Those would be seen as 'missing' by ES.

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "my_boolean": { "type": "boolean"}
      }
    }
  }
}

PUT my_index/my_type/1
{"my_boolean": true}

PUT my_index/my_type/2
{"my_boolean": false}

PUT my_index/my_type/3
{"my_boolean": null}

This has several issues, one of them being aggregations. There doesn't seem to be an easy way to get the values true, false and NULL in an aggregation.

The missing feature is known to me, so I know I can do the following:

GET my_index/_search
{
  "size":0,
  "aggregations": {
    "my_boolean": {
      "terms": {
        "field": "my_boolean"
      }
    },
    "missing_fields": {
          "missing" : {
            "field": "my_boolean"
          }
    }
  }
}

But this will result in a bucket with 2 values (true/false) and a separate count for the missing documents. That looks like it will cause problems.

Option 2

Another option is to actually give the NULL a value, as described in the manual. The problem is that the value needs to be the correct type, and there is nothing but true and false as a boolean.

The null_value needs to be the same datatype as the field. For instance, a long field cannot have a string null_value.

This means we can use a different type that supports more then 2 values, e.g. integer, but that would be in my head the same as saying: lets map it as integer, and define 1 as true, 2 as false and 3 as null. This would work, but we'd have an implicit mapping that all should know about. (All producers/consumers/whatyamahaveits).

Option 3

A final version would be to try and script our way out of this problem.

GET my_index/_search
{
  "size":0,
  "aggregations": {

    "my_boolean": {
      "terms": {
       "script" : {
        "inline": "if(doc['my_boolean'].length === 1) { if(doc['my_boolean'].value === true){ return 1;} else {return 2;} } else { return 3;}"
        }
      }
    }
  }
}

Now we do get the right results in somewhat sane buckets.

"aggregations": {
"my_boolean": {
  "doc_count_error_upper_bound": 0,
  "sum_other_doc_count": 0,
  "buckets": [
    {
      "key": "1",
      "doc_count": 1
    },
    {
      "key": "2",
      "doc_count": 1
    },
    {
      "key": "3",
      "doc_count": 1
    }
  ]
}
}

Note that we still have an implicit mapping with the keys here, so this seems to have some of the same issues that mapping it as an integer has. But still, your datatype is what it should be, so that might be something. Note that we cannot have a bucket with 'null' as key. We can call them "true", "false" and "null" (strings) ofcourse, but this is the same situation, but hidden even more.

Question

what is the best way to deal with this null-problem? (Or maybe we should call it a 'tri-state-boolean-problem'?)

To clarify: we fear that later on a 'non-standard' value might cause problems. The first we saw was the bucketing which we might be able to fix with above script solution, but maybe we run into other issues later. So we are looking for the best-practice of saving this type of data, rather then a quick solution for a specific problem.

Nanne
  • 64,065
  • 16
  • 119
  • 163
  • Good research. In your first option, do you agree that the counts of the true/false buckets + the count of the missing bucket will correctly add up to the total document count? If yes, why do you think that it will cause problems? I would advise against scripting as it won't scale well as your document count grows... Another way could be to sort of "encode" the tri-state of your boolean field into another field so that you can keep your boolean field clean and use that other field to perform your aggregations. – Val Dec 19 '16 at 16:02
  • After reviewing some options on how to best represent values in [three-value logic](https://en.wikipedia.org/wiki/Three-valued_logic), it is not uncommon to use integer values to store the three states (my preference going to -1/0/1). – Val Dec 20 '16 at 07:51
  • Another option is to keep the `my_boolean` field as a boolean with two values (true/false) and have a second boolean field `my_null_boolean` that would be `true` if `my_boolean` is `null` and `null` if `my_boolean` has definite values, i.e. either `true` or `false` – Val Dec 20 '16 at 07:53
  • @val The reason I think this might cause problems would be e.g. the simple case where you want to draw a (kibana?) graph of all 3 buckets. But you don't get 3 buckets, you get 2 buckets and some separate data. You might be able to work around it, but the quickest / most trivial way to make a graph isn't available. Now that's just a quick example, I'm just afraid more of that type of issue might come along with for instance, complicated queries, scoring etc. The same might be the case with a second field, though it does look like a valid option. – Nanne Dec 20 '16 at 08:18
  • Ok, I understand that the Kibana aspect can be problematic, indeed. Why would the keyword/string approach with values "true", "false", "null" not work for you? – Val Dec 20 '16 at 08:25
  • it would probably work, it's just that it is in a broader spectrum a suboptimal sollution. I have a system in which actual true/false/null is used, so i need a mapping to the strings to get them in. Also, all consumers of this instance must be aware of this mapping. This can be done and as I look at it now its probably the best way, but having multiple places to map these values is annoying, so that's why I was checking for best practices ;D – Nanne Dec 20 '16 at 08:32
  • Well, why not using the [`missing` setting](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_missing_value_12) of the `terms` aggregation (i.e. not a separate `missing` aggregation)?, you could keep using your boolean field and get your three buckets with 0, 1 and -1 (for null)? – Val Dec 20 '16 at 08:45
  • yeah , that's a good alternative as well. This, in my view, is equivalent to any 'mapping' option: our current datamodel (apart from how good that is) works with true/false/null, and in ES we'd have to either explicitly cast it to something else (integer,string), or use missing/null, but then cast it during aggregation. I think I'm leaning towards explicitly casting it, so you'd have a somewhat clear datatype / structure in ES, even if slightly different from the producing side of the data. – Nanne Dec 20 '16 at 09:21
  • Ok, glad you're narrowing in on a solution that could work for you. – Val Dec 20 '16 at 09:22
  • Anything else you need? – Val Dec 23 '16 at 05:44
  • @val Not sure there is more to find out about this topic sadly, so not here. But if you feel inclined, you could take a look at this topic :) http://stackoverflow.com/questions/41285790/what-differs-between-post-filter-and-global-aggregation-for-faceted-search – Nanne Dec 23 '16 at 10:25
  • Ok then maybe write an answer with your chosen solution so you get your rep back ;-) – Val Dec 23 '16 at 10:26
  • Haha, yeah. I'm was actually waiting if you might wanted to type something, as not to 'steal' my own bounty back :D if hte 24-hour warnign comes, i'll definately write a self- answer :D – Nanne Dec 23 '16 at 10:31

2 Answers2

5

You could use the missing setting of the terms aggregation (i.e. not a separate missing aggregation).

That way, you could keep using your boolean field and get your three buckets with 0, 1 and -1 (for null)?

{
  "size":0,
  "aggregations": {
    "my_boolean": {
      "terms": {
        "field": "my_boolean",
        "missing": -1                 <--- add this
      }
    }
  }
}

It doesn't have the disadvantage of having to change the field type and encoding it into some other data type (integer / string) and also frees you from leveraging scripting, since that won't scale very well.

Val
  • 207,596
  • 13
  • 358
  • 360
0

In the end we went for mapping the various states to a byte.

A missing value only works if the type is capable of having that value, so we need the mapping anyway, so we add the extra number during insertion.

So instead of a boolean with true, false and null values, or an integer with 1, 2 and null (with missing = -1) values, we use a byte with 1, 2 and 3, meaning (in random order) true, false and null.

Nanne
  • 64,065
  • 16
  • 119
  • 163