The closer the better approach in Lucene

Question

I'm pretty new to Lucene, so forgive me in advance if some of my terminology is wrong.

Lucene offers different types of fields (keyword, text, unstored, unindexed), but it seems it also supports Numeric field, Int field and Float field.

Now, I'm wondering if "the closer the better" functionality exists/or is easy to implement in Lucene:

I want the creation_date of a document stored as the unix time into a float field. Then I want to be able to compare the unix time given in a query with the indexed unix time of the documents.

Instead of a range query (which checks if the range is between particular bounds) or a boolean query (which checks if the values are the same) I want to be able to return a sense of similarity based on the time between the unix times. If the timespan is small it should end up with a higher score than if the timespan is large. Preferably this shouldn't happen linear but instead exponentially for example. So as the title of this question says: The closer, the better.

I've noticed that ElasticSearch, which uses Lucene as core offers decay function scores, is this the behaviour that I'm looking for and is this present in Lucene?

Lastly, I'm wondering: can one compare this 'type' of scoring together with the default tf-idf scoring that is used to query the body of the documents, in a way that the final score is a combination of the score of the timespan between the documents and the textual similarity of the bodies.

score 1 · Accepted Answer · answered Sep 22 '15 at 13:31

1

I dont think you get it out of the box like elastic search. You could always try to add it yourself as a module. These algorithms are available at large on the internet.

You could also use the boosting and negative boosting systems in lucene in combination with the exisiting ranking system to experiment if that gives you the sort of results you would want. I am doing that on apache SOLR and it's working like a charm :)

on your last point, tf-idf module is available in solr, if not already in lucene just copy it from solr and add it as module in lucene and combine your own module with the tf-idf module to achieve a combined result.

answered Sep 22 '15 at 13:31

Mark Stroeven

676
2
6
24

Could you point me to such a module, Ive been struggling to find one. I'm however not sure how I could use the existing ranking system for my cause. Wouldn't it by default compare the float value to the other float values and match only those that have the same value? Regarding the tf-idf module, I guess you are referring to the tf-idf similarity class (https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html)? – DJanssens Sep 22 '15 at 13:39
Yes, on the tf-idf part that is. The wonderfull thing about lucene is that you can basically re configure it the way you would like. I implemented a support vector machine to achieve this result. but you could search for standard decay algorithms. It is realy depandent on your situation, an example -> http://stackoverflow.com/questions/11653545/hot-content-algorithm-score-with-time-decay – Mark Stroeven Sep 22 '15 at 13:43
I believe it's in a way what I need. Sadly there lacks decent information on how to implement things like this in Lucene. I have no idea where I should start with an algorithm like that - like should one create a new Similarity class for that? Would switching to ElasticSearch/Solr solve my issue for this? Lastly, did you incorporate the SVM while using solr or Lucene? – DJanssens Sep 22 '15 at 14:31

The closer the better approach in Lucene

1 Answers1