29

I'm currently implementing elasticsearch in my Symfony2 application via the FOQElasticaBundle and so far it's been working great based on boosts applied to various fields of my "Story" entity. Here is the config:

foq_elastica:
    clients:
        default: { host: localhost, port: 9200 }

    indexes:
        website:
            client: default
            types:
                story:
                    mappings:
                        title: { boost: 8 }
                        summary: { boost: 5 }
                        text: { boost: 3 }
                        author:
                    persistence:
                        driver: orm # orm, mongodb, propel are available
                        model: Acme\Bundle\StoryBundle\Entity\Story
                        provider:
                            query_builder_method: createIsActiveQueryBuilder
                        listener:
                            service: acme_story.search_index_listener
                        finder:

However I'd like to also apply a boost based on the "published_at" date of the story, so that a story published yesterday would appear in the results before a story published 6 months ago - even if the older story had a slightly better score (obviously this will need a bit of tweaking). Is this possible?

If anyone could let me know how to achieve this using FOQElasticaBundle that would be great, but otherwise I'd appreciate it if you could let me know how to achieve this directly in elasticsearch so I can try and implement the behaviour myself and contribute to the bundle if needs be.

Thanks.

RobMasters
  • 4,108
  • 2
  • 27
  • 34
  • Why don't you modify your query so that you orderBy published_at? – Mick Aug 23 '12 at 12:52
  • @Patt Thanks, but which query do you mean? I just ordered the results of the query_builder_method but that had no effect. Then I tried altering the query sent to elasticsearch to be the following: { "query" : { "query_string" : { "query" : "something" } }, "sort" : [ { "publishedAt" : "desc" } ] }. This resulted in the error: " Parse Failure [No mapping found for [publishedAt] in order to sort on". Adding "publishedAt:" under my mappings config didn't help. – RobMasters Aug 23 '12 at 13:30
  • Ok ignore my previous comment...I was able to get the sort working after adding the 'publishedAt' mapping to my config and running "app/console foq:elastica:populate" to apply the change to elasticsearch. However this isn't really what I was after - it means that a very partial match to the search term will will appear before an exact match that may have been published only seconds beforehand. I definitely think some kind of boost is required based on the 'Recency' of stories, but this seems to be missing from all documentation... – RobMasters Aug 23 '12 at 14:16

3 Answers3

49

Whew, after much experimentation and hours of trawling the Interweb I finally managed to get the desired behavior! (Full credit goes to Clinton Gormley.)

Mapping configuration:

mappings:
    title: { boost: 8 }
    summary: { boost: 5 }
    text: { boost: 3 }
    author:
    publishedAt: { type: date }

Here is the code using the PHP client, Elastica, to dynamically build the query to boost using the original mapping AND the published date:

$query = new \Elastica_Query_Bool();
$query->addMust(new \Elastica_Query_QueryString($queryString));

$ranges = array();
for ($i = 1; $i <= 5; $i++) {
    $date = new \DateTime("-$i month");

    $currentRange = new \Elastica_Query_Range();
    $currentRange->addField('publishedAt', array(
        'boost' => (6 - $i),
        'gte' => $date->getTimestamp()
    ));

    $ranges[] = $currentRange->toArray();
}

$query->addShould($ranges);

/** @var $pagerfanta Pagerfanta */
$pagerfanta = $this->getFinder()->findPaginated($query);

And for those of you more interested in the raw elasticsearch query (only with 3 date ranges for brevity)...

curl -XPOST 'http://localhost:9200/website/story/_search?pretty=true' -d '
{
  "query" : {
    "bool" : {
      "must" : {
        query_string: {
          query: "<search term(s)>"
        }
      },
      "should" : [
        {
          "range" : {
            "publishedAt" : {
              "boost" : 5,
              "gte" : "<1 month ago>"
            }
          }
        },
        {
          "range" : {
            "publishedAt" : {
              "boost" : 4,
              "gte" : "<2 months ago>"
            }
          }
        },
        {
          "range" : {
            "publishedAt" : {
              "boost" : 3,
              "gte" : "<3 months ago>"
            }
          }
        }
      ]
    }
  }
}'
RobMasters
  • 4,108
  • 2
  • 27
  • 34
  • Well done @RobMasters. This is perfection! I hope I haven't misguided you too much... The solution you are bringing is really solid stuff! – Mick Sep 06 '12 at 13:22
  • Sub-question: why should using "gte" and not "from"-"to" range for dates? – g.annunziata Mar 12 '14 at 14:58
  • @g.annunziata - the reason was that the boosts should be cumulative in order to correctly prioritize the most recent items. i.e. Something added 6 weeks ago will receive the boosts for being greater than or equal to the dates of <2 months ago> and <3 months ago>, whereas something from 10 weeks ago will only receive the gte <3 months ago> boost. This is only an example of what worked for me - you may achieve better results by tweaking the boosting criteria. – RobMasters Mar 13 '14 at 16:11
  • What about performance ? I am afraid this will apply the scoring on all your documents? – Thomas Decaux Apr 28 '14 at 09:15
35

You can use a decay scoring function, to decrease the scoring versus time :

{
 "query": {
 "function_score": {
    "functions": [
     {
      "linear": {
        "pubdate": {
          "origin": 1398673886,
          "scale": "1h",
          "offset": 0,
          "decay": 0.1
        }
      }
    }
    ]
   }
  }
 }
Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
16

A full elasticsearch 5 example based on function_score. See this blogpost and function_score docs for more info.

Allows for boosting more recent entries based on multiple date ranges, with varying strengths, on a gaussian curve without "hard cutoffs".

{
    "query": {
        "function_score": {

            "score_mode": "sum", // All functions outputs get summed
            "boost_mode": "multiply", // The documents relevance is multiplied with the sum

            "functions": [
                {
                    // The relevancy of old posts is multiplied by at least one.
                    // Remove if you want to exclude old posts
                    "weight": 1
                },
                {
                    // Published this month get a big boost
                    "weight": 5,
                    "gauss": {
                        "date": { // <- Change to your date field name
                            "origin": "2017-04-07", // Change to current date
                            "scale": "31d",
                            "decay": 0.5
                        }
                    }
                },
                {
                    // Published this year get a boost
                    "weight": 2,
                    "gauss": {
                        "date": { // <- Change to your date field name
                            "origin": "2017-04-07", // Change to current date
                            "scale": "356d",
                            "decay": 0.5
                        }
                    }
                }
            ],

            "query": {
                // The rest of your search here, change to something relevant
                "match": { "title": "< your search string >" }
            }
        }
    }
}
Simon Epskamp
  • 8,813
  • 3
  • 53
  • 58