5

We are considering a schema with two multi-valued fields. Search is performed on the first field, but sorting should be done on the second field, using the corresponding value. E.g. if documents match because of the n-th value in the first field (where n may be different for each match), then they should be returned sorted by the n-th value in the second field.

Is that possible?

Background: each document has a list of similar documents (IDs) and a corresponding list of similarity scores (value between 0 and 1). Given ID 42, we need to return all similar documents (e.g. documents with 42 in the first field), sorted by their similarity to document 42.

Other schemas we are considering are:

  1. Dynamic fields for each ID so we can sort by the field Similarity_ID42 when searching for documents similar to 42. This does not seem to scale, at 800K+ documents, CPU goes to 100% during indexing.
  2. A single multi-valued field storing "ID.score" as a decimal (e.g. 42.563) and then searching for all documents that have a value that is > 42 AND < 43, and sorting by that value (I'm not even sure this is possible).
Michiel van Oosterhout
  • 22,839
  • 15
  • 90
  • 132
  • About the alternatives you are considering, I am not sure if they will help you. Would you shed some light on what you want to present to the user? Will you have a list of documents and alongside each document another list of similar documents? – cheffe Dec 30 '13 at 10:44
  • I simply need to retrieve from Solr the list of documents similar to a document (with ID 42 in my examples), sorted by their similarity score. I can't seem to find a way to do the sorting in Solr, as dynamic fields don't seem to scale beyond a certain point. – Michiel van Oosterhout Dec 30 '13 at 13:34
  • As it got lengthy, I added an alternative approach to my answer. – cheffe Dec 30 '13 at 13:53

1 Answers1

3

The approach will not succeed, as you can search, but you cannot sort by a multivalued field. This pointed out in Sorting with Multivalued Field in Solr and written in Solr's Wiki

Sorting can be done on the "score" of the document, or on any multiValued="false" indexed="true" field provided that field is either non-tokenized (ie: has no Analyzer) or uses an Analyzer that only produces a single Term (ie: uses the KeywordTokenizer)

Update

About the alternatives, as you point out that you need to find similar documents for one given ID, why not create a second core with a schema like

<fields>
    <field name="doc_id" type="int" indexed="true" stored="true" />
    <field name="similar_to_id" type="int" indexed="true" stored="true" />
    <field name="similarity" type="string" indexed="true" stored="true" />
</fields>

<types>
    <fieldType name="int" class="solr.TrieIntField"/>
    <fieldType name="string" class="solr.StrField" />
</types>

Then you could do a second query, after performing the actual search

q=similar_to_id=42&sort=similarity

Community
  • 1
  • 1
cheffe
  • 9,345
  • 2
  • 46
  • 57
  • Good suggestion, although I forgot to mention that we need the ability to further filter similar documents in the original core (which has all the regular document fields). With a multi-core, multi-query solution, it becomes quite complex, and you will end up sending lots of IDs in one of the queries. – Michiel van Oosterhout Dec 30 '13 at 14:46
  • Hm, and I bet there is a good reason that you do not use Solr's built in [more like this](https://cwiki.apache.org/confluence/display/solr/MoreLikeThis)? – cheffe Dec 31 '13 at 00:57
  • Yes, similarity is based on collaborative filtering. The main problem is how to enable sorting by similarity score in the schema. Anyway, not possible is a valid answer, and – Michiel van Oosterhout Dec 31 '13 at 17:12