How to combine Elasticsearch function score query and text proximity scoring

Question

I'd like to use function score query and text proximity with weight. But the query does not correctly calcurate score of "match_phrase" in "query.function_score.functions"

For example, let's say I'm creating curation media and put a banner link of "Financial articles in 2017".

I'd like to filter and score like below,

Filter
- Articles must be created in 2017.
- The category must be "finance".
Scoring
- The more "favorite" articles is, the higher score.
- If the article has the comment within last 1 month, it gets the higher score.
- If the article has the certain tags, gets the higher score.
  - (tags might be more than 100+ words)

and data has precondition,

Precondition
- the dataset is more than 2 million document
- articles must have one "category"
- articles might have one or more "tags"
  - tags might be more than 1000+ in the single article
- "tags_text" is string text and it is alphabetical order and joined by whitespace
  - ref: [Finding most similar arrays of integers in elasticsearch
- "favorite" is number that people set the article to "favorite" (e.g. facebook-like button)

example data and query

// create index
$ curl -XPUT 'http://localhost:9200/blog'

And put articles,

// create articles
curl -XPUT http://localhost:9200/blog/article/1 -d '
{
  "article_id": 1,
  "title": "Fintech company list in London",
  "tags": ["fintech", "uk", "london"],
  "tags_text": "fintech london uk",
  "category": "finance",
  "created_at": "2016-12-01T00:00:00Z",
  "last_comment_at": null,
  "favorite": 100
}'

curl -XPUT http://localhost:9200/blog/article/2 -d '
{
  "article_id": 2,
  "title": "World economy",
  "tags": ["world", "economy", "regression", "war"],
  "tags_text": "economy regression war world",
  "category": "finance",
  "created_at": "2017-02-15T00:00:00Z",
  "last_comment_at": "2017-11-01T00:00:00Z",
  "favorite": 20
}'

curl -XPUT http://localhost:9200/blog/article/3 -d '
{
  "article_id": 3,
  "title": "Bitcoin bubble",
  "tags": ["bitcoin", "bubble", "btc", "mtgox", "wizsec"],
  "tags_text": "bitcoin btc bubble mtgox wizsec",
  "category": "finance",
  "created_at": "2017-08-03T00:00:00Z",
  "last_comment_at": null,
  "favorite": 50
}'

curl -XPUT http://localhost:9200/blog/article/4 -d '
{
  "article_id": 4,
  "title": "Virtual currency in China",
  "tags": ["bitcoin", "ico", "china"],
  "tags_text": "bitcoin china ico",
  "category": "finance",
  "created_at": "2017-09-03T00:00:00Z",
  "last_comment_at": null,
  "favorite": 10
}'

curl -XPUT http://localhost:9200/blog/article/5 -d '
{
  "article_id": 5,
  "title": "Average FX rate in 2017-10",
  "tags": ["fx", "currency", "doller"],
  "tags_text": "currency doller fx",
  "category": "finance",
  "created_at": "2017-11-01T00:00:00Z",
  "last_comment_at": null,
  "favorite": 10
}'

curl -XPUT http://localhost:9200/blog/article/6 -d '
{
  "article_id": 6,
  "title": "Cat and Dog",
  "tags": ["pet", "cat", "dog", "family"],
  "tags_text": "cat dog family pet",
  "category": "pet",
  "created_at": "2017-11-02T00:00:00Z",
  "last_comment_at": null,
  "favorite": 500
}'

Then execute query,

curl -XGET 'http://localhost:9200/blog/article/_search' -d '
{
  "_source": {
    "includes": ["article_id", "title", "tags_text"]
  },
  "query": {
    "function_score": {
      "functions": [
        {
          "field_value_factor": {
            "factor": 1,
            "modifier": "log",
            "field": "favorite"
          },
          "weight": 0.3
        },
        {
          "filter": {
            "range": {
              "last_comment_at": {
                "from": "now-30d",
                "to": null,
                "include_lower": true,
                "include_upper": false
              }
            }
          },
          "weight": 0.3
        },
        {
          "filter": {
            "match_phrase": {
              "tags_text": {
                "query": "bitcoin fintech smartphone",
                "slop": 100
              }
            }
          },
          "weight": 0.4
        }
      ],
      "query": {
        "bool": {
          "filter": [
            {"term": {"category": "finance"} },
            {
              "range": {
                "created_at": {
                  "from": "2017-01-01T00:00:00",
                  "to": "2017-12-31T23:59:59",
                  "include_lower": true,
                  "include_upper": true
                }
              }
            }
          ],
          "must": {
            "match_all": {}
          }
        }
      },
      "score_mode": "sum"
    }
  }
}'

The results are like below,

{
  "took": 7,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 0.69030905,
    "hits": [
      {
        "_index": "blog",
        "_type": "article",
        "_id": "2",
        "_score": 0.69030905,
        "_source": {
          "article_id": 2,
          "tags_text": "economy regression war world",
          "title": "World economy"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "3",
        "_score": 0.509691,
        "_source": {
          "article_id": 3,
          "tags_text": "bitcoin btc bubble mtgox wizsec",
          "title": "Bitcoin bubble"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "5",
        "_score": 0.3,
        "_source": {
          "article_id": 5,
          "tags_text": "currency doller fx",
          "title": "Average FX rate in 2017-10"
        }
      },
      {
        "_index": "blog",
        "_type": "article",
        "_id": "4",
        "_score": 0.3,
        "_source": {
          "article_id": 4,
          "tags_text": "bitcoin china ico",
          "title": "Virtual currency in China"
        }
      }
    ]
  }
}

I checked result with "explain" but it seemed that "match_phrase" query to "tags_text" field does not affect to scoring at all.

How to use weighted similarity scoring and function score query? (I checked by ES v2.4.0)

Unfortunately, no. I still want to do it, so I'll repost this in ES community forum. — evalphobia, Jan 09 '19 at 02:42
@HugoLaplace Reposted here https://discuss.elastic.co/t/how-to-combine-elasticsearch-function-score-query-and-text-proximity-scoring-with-weight/163651 — evalphobia, Jan 10 '19 at 03:40
I ended using fuzzy Levenshtein distance on elasticsearch part combined with regexp & keras neural network on app part to solve my particular problem. — Hugo Laplace, Jan 10 '19 at 10:39

How to combine Elasticsearch function score query and text proximity scoring

example data and query

0 Answers0