0

We are trying to design a recommender system of documents in which documents are constantly being updated. Actually the documents are streams to which text usually gets appended.

Initially we planned to use lucene + solr. But that is good for mostly static documents.The way lucene updates a document is by deleting it first and then reindexing it. So if document is updated frequently above approach results in slower indexing as corpus size and average document size increases.

We were also tempted to build our own solution but gave up after prototyping as we were drifting towards re-inventing information retrieval functionalities which were already implemented quite well in lucene. Does any one has any experience of building this kind of system by integrating open source search and machine-learning tools.

khrist safalhai
  • 560
  • 5
  • 19

1 Answers1

0

In order to update a value of any field in a document without re-indexing the whole document you can use DocValues. You can read about DocValues in the following blog http://shaierera.blogspot.com/2014/04/updatable-docvalues-under-hood.html

Ivan Mamontov
  • 2,874
  • 1
  • 19
  • 30
  • In our case its the single field "content" that constitutes most of the document. And it is to this field the textual content is constantly being appended. Is there any way to get this to-be-appended-data into lucene data structure without causing whole "content" field to be reindexed? – khrist safalhai Mar 24 '15 at 16:16
  • 1
    Do you really need lucene? Lucene does not support this feature because term(field + value) is the smallest unit of work. As workaround you can logically divide your field into several fields or you can use dynamic field for every update and search in copyFiled http://stackoverflow.com/questions/6213184/solr-search-query-for-dynamic-fields-indexed but I don't think that this solution will be effective – Ivan Mamontov Mar 24 '15 at 17:12
  • Yes term is the smallest unit of work, but As I understand term constitute field+token, not the value, as value is a collection of tokens that will be indexed. If we ignore lucene, could you suggest any other approach to build a recommendation engine for constantly-appended documents. – khrist safalhai Mar 25 '15 at 17:15