0

I have a field path in my elastic-search documents which has entries like this

/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_011007/stderr
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_008874/stderr

#*Note -- I want to select all the documents having below line in the **path** field
/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr

I want to make a like query on this path field given certain things(basically an AND condition on all the 3):-

  1. I have given application number 1451299305289_0120
  2. I have also given a task number 009257
  3. The path field should also contain stderr

Given the above criteria the document having the path field as the 3rd line should be selected

This is what I have tries so far

http://localhost:9200/logstash-*/_search?q=application_1451299305289_0120 AND path:stderr&size=50

This query fulfills the 3rd criteria, and partially the 1st criteria i.e if I search for 1451299305289_0120 instead of application_1451299305289_0120, I got 0 results. (What I really need is like search on 1451299305289_0120)

When I tried this

http://10.30.145.160:9200/logstash-*/_search?q=path:*_1451299305289_0120*008779 AND path:stderr&size=50

I got the result, but using * at the start is a costly operation. Is their another way to achieve this effectively (like using nGram and using fuzzy-search of elastic-search)

Anurag Sharma
  • 4,839
  • 13
  • 59
  • 101
  • Using nGram will very costly however what you can do edgeNGram use a couple of filters while analyzing.. I suggest you can look into this article.. http://stackoverflow.com/questions/9421358/filename-search-with-elasticsearch# It may be of little help, as in you can get some direction.. – Anirudh Modi Dec 30 '15 at 11:31

1 Answers1

1

This can be achieved by using Pattern Replace Char Filter. You just extract only important bits of information with regex. This is my setup

POST log_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "app_analyzer": {
          "char_filter": [
            "app_extractor"
          ],
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "path_analyzer": {
          "char_filter": [
            "path_extractor"
          ],
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        },
        "task_analyzer": {
          "char_filter": [
            "task_extractor"
          ],
          "tokenizer": "keyword",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      },
      "char_filter": {
        "app_extractor": {
          "type": "pattern_replace",
          "pattern": ".*application_(.*)/container.*",
          "replacement": "$1"
        },
        "path_extractor": {
          "type": "pattern_replace",
          "pattern": ".*/(.*)",
          "replacement": "$1"
        },
        "task_extractor": {
          "type": "pattern_replace",
          "pattern": ".*container.{27}(.*)/.*",
          "replacement": "$1"
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "name": {
          "type": "string",
          "analyzer": "keyword",
          "fields": {
            "application_number": {
              "type": "string",
              "analyzer": "app_analyzer"
            },
            "path": {
              "type": "string",
              "analyzer": "path_analyzer"
            },
            "task": {
              "type": "string",
              "analyzer": "task_analyzer"
            }
          }
        }
      }
    }
  }
}

I am extracting application number, task number and path with regex. You might want to optimize task regex a bit if you have some other log pattern, then we can use Filters to search.A big advantage of using filters is that they are cached and make subsequent calls faster.

I indexed sample log like this

PUT log_index/your_type/1
{
  "name" : "/logs/hadoop-yarn/container/application_1451299305289_0120/container_e18_1451299305289_0120_01_009257/stderr"
}

This query will give you desired results

GET log_index/_search
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "must": [
            {
              "term": {
                "name.application_number": "1451299305289_0120"
              }
            },
            {
              "term": {
                "name.task": "009257"
              }
            },
            {
              "term": {
                "name.path": "stderr"
              }
            }
          ]
        }
      }
    }
  }
}

On a side note filtered query is deprecated in ES 2.x, just use filter directly.Also path hierarchy might be useful for some other uses

Hope this helps :)

ChintanShah25
  • 12,366
  • 3
  • 43
  • 44