4

ElasticSearch enables us to filter a set of documents by regex on any given field, and also to group the resulting documents by the terms in a given (same or different field, using "bucket aggregations". For example, on an index that contains a "Url" field and a "UserAgent" field (some kind of web server log), the following will return the top document counts for terms found in the UserAgent field.

{
    query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
    size: 0,                            
    aggs: { myaggregation: { terms: { field: "UserAgent" } } }                          
}

What I'd like to do is use the power of the regexp filter (which operates on the whole field, not just terms within a field) to manually define my aggregation buckets, so that I can relatively reliably split my documents/counts/hits by "user agent type" data, rather than the arbitrary terms parsed by elastic search in the field.

Basically, I am looking for the equivalent of a CASE statement in a GROUP BY, in SQL terms. The SQL query that would express my intent would be something like:

SELECT Bucket, Count(*)
FROM (
    SELECT CASE 
        WHEN UserAgent LIKE '%android%' OR UserAgent LIKE '%ipad%' OR UserAgent LIKE '%iphone%' OR UserAgent LIKE '%mobile%' THEN 'Mobile'
        WHEN UserAgent LIKE '%msie 7.0%' then 'IE7'
        WHEN UserAgent LIKE '%msie 8.0%' then 'IE8'
        WHEN UserAgent LIKE '%firefox%' then 'FireFox'
        ELSE 'OTHER'
        END Bucket
    FROM pagedata
    WHERE Url LIKE '%interestingpage%'
) Buckets
GROUP BY Bucket

Can this be done in an ElasticSearch query?

Tao
  • 13,457
  • 7
  • 65
  • 76
  • Not sure how I missed this before, but there seems to be an existing similar question/answer around script-based bucketing: http://stackoverflow.com/questions/23256743/elasticsearch-aggregation-using-a-script-to-transform-field-value-being-aggregat – Tao Jun 18 '15 at 10:34

2 Answers2

3

This is an interesting use-case.

Here's a more Elasticsearch-way solution. The idea is to do all this regex matching at indexing time and the search time to be fast (scripts during search time, if there are many documents, are not performing well and will take time). Let me explain:

  1. define a sub-field for your main field, in which the manipulation of terms is customized

  2. this manipulation will be performed so that the only terms that will be kept in the index will be the ones you defined: FireFox, IE8, IE7, Mobile. Each document can have more than one of these fields. Meaning a text like msie 7.0 sucks and ipad rules will generate only two terms: IE7 and Mobile.

All this is made possible by the keep token filter.

  1. there should be another list of token filters that will actually perform the replacement. This will be possible by using the pattern_replace token filter.

  2. because you have two words that should be replaced (msie 7.0 for example), you need a way to capture these two words (msie and 7.0) one beside the other. This will be possible using the shingle token filter.

Let me put all this together and provide the complete solution:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_pattern_replace_analyzer": {
          "tokenizer": "whitespace",
          "filter": [
            "filter_shingle",
            "my_pattern_replace1",
            "my_pattern_replace2",
            "my_pattern_replace3",
            "my_pattern_replace4",
            "words_to_be_kept"
          ]
        }
      },
      "filter": {
        "filter_shingle": {
          "type": "shingle",
          "max_shingle_size": 10,
          "min_shingle_size": 2,
          "output_unigrams": true
        },
        "my_pattern_replace1": {
          "type": "pattern_replace",
          "pattern": "android|ipad|iphone|mobile",
          "replacement": "Mobile"
        },
        "my_pattern_replace2": {
          "type": "pattern_replace",
          "pattern": "msie 7.0",
          "replacement": "IE7"
        },
        "my_pattern_replace3": {
          "type": "pattern_replace",
          "pattern": "msie 8.0",
          "replacement": "IE8"
        },
        "my_pattern_replace4": {
          "type": "pattern_replace",
          "pattern": "firefox",
          "replacement": "FireFox"
        },
        "words_to_be_kept": {
          "type": "keep",
          "keep_words": [
            "FireFox", "IE8", "IE7", "Mobile"
          ]
        }
      }
    }
  },
  "mappings": {
    "test": {
      "properties": {
        "UserAgent": {
          "type": "string",
          "fields": {
            "custom": {
              "analyzer": "my_pattern_replace_analyzer",
              "type": "string"
            }
          }
        }
      }
    }
  }
}

Test data:

POST /test/test/_bulk
{"index":{"_id":1}}
{"UserAgent": "android OS is the best firefox"}
{"index":{"_id":2}}
{"UserAgent": "firefox is my favourite browser"}
{"index":{"_id":3}}
{"UserAgent": "msie 7.0 sucks and ipad rules"}

Query:

GET /test/test/_search?search_type=count
{
  "aggs": {
    "myaggregation": {
      "terms": {
        "field": "UserAgent.custom",
        "size": 10
      }
    }
  }
}

Results:

   "hits": {
      "total": 3,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "myaggregation": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "FireFox",
               "doc_count": 2
            },
            {
               "key": "Mobile",
               "doc_count": 2
            },
            {
               "key": "IE7",
               "doc_count": 1
            }
         ]
      }
   }
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
  • This is a very interesting and thorough answer, but unfortunately it is not applicable in my case (and does not answer the question as stated) because I do not have any control over the index and indexing process, nor does my need justify the owners making changes to the indexing process. I really need a query-based solution rather than an index-based solution. – Tao Jun 18 '15 at 10:25
  • @Tao this might be the only solution. The Groovy script one - I considered it first, as well - but it will not work all the time because texts that contain both `firefox` and `ipad` will only be found in one bucket, not two buckets, as they should. The Groovy script will return either `FireFox` or `Mobile`, not both. An agent like `Mozilla/5.0 (Android 2.2.2; Linux; Opera Mobi/ADR-1103311355; U; en; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 Opera 11.00` will fall into `Mobile` bucket, and not in `FireFox` as well. – Andrei Stefan Jun 18 '15 at 11:04
  • "texts that contain both firefox and ipad will only be found in one bucket, not two buckets" - this is actually the way I want it to work (and the way that the SQL works) - I see your concern (and am curious as to how a script aggregation can overcome this), but the script aggregation really does look like the right (and only) way to get what I need in the environment that I have. I just need to convince the owners to enable scripting :) – Tao Jun 18 '15 at 18:38
  • If that's the requirement, then cool :-). But my solution is a no-go since you have no way to change the mapping. – Andrei Stefan Jun 18 '15 at 19:11
2

You could use a terms aggregation with a scripted field:

{
  query: { filtered: { filter: { regexp: { Url : ".*interestingpage.*" } } } },
  size: 0,
  aggs: {
    myaggregation: {
      terms: {
        script: "doc['UserAgent'] =~ /.*android.*/ || doc['UserAgent'] =~ /.*ipad.*/ || doc['UserAgent'] =~ /.*iphone.*/ || doc['UserAgent'] =~ /.*mobile.*/ ? 'Mobile' : doc['UserAgent'] =~ /.*msie 7.0.*/ ? 'IE7' : '...you got the idea by now...'"
      }
    }
  }
}

But beware of the performance hit!

Tao
  • 13,457
  • 7
  • 65
  • 76
Shadocko
  • 1,186
  • 9
  • 27
  • This looks like the "Right" answer, but I can't get it to work - I'll have to figure out whether we're running an old version of ES or whether there are restrictions on scripts or something. – Tao Jun 18 '15 at 10:30
  • 1
    Probably you need to [enable dynamic scripting](https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-scripting.html#enable-dynamic-scripting), but beware of [this](http://www.cve.mitre.org/cgi-bin/cvename.cgi?name=2015-1427) and follow the instructions [here](https://www.elastic.co/blog/running-groovy-scripts-without-dynamic-scripting). – Andrei Stefan Jun 18 '15 at 10:49
  • @Tao thanks for the edit and sorry for the fumble! You should also vote up Andrei's answer because it feels like the better one for the general case. Did you get it to work? – Shadocko Jun 18 '15 at 12:07
  • For completeness, let me note here that our administrators had switched from Groovy to Lucene Expressions for dynamic scripting in our ES installation. Lucene expression scripts cannot access non-numeric fields in the index, so this solution cannot be used (in my environment at the moment). I will most likely need to lobby for my ES admins to introduce calculated fields in the index as @AndreiStefan suggested in another answer, even though this scripting answer is closest to what I requested. – Tao Jun 19 '15 at 15:39
  • Well, your administrators' choice was to be expected given CVE-2015-1427... Do you have the possibility to store script files in config/scripts/ ? – Shadocko Jun 22 '15 at 07:15