Apache Solr 5 - deduplicating data within a field

Question

Here is my question (pardon the wordiness): I have millions of documents and all of them are unique.

However, all documents contain a 'description' field and this field contains data that only has a few different variations in the text across all 10 million documents. This field is large-ish -400-800 words or so.

What is the most appropriate way to eliminate this repetition of data in the 'description' field?

Let me elaborate. Here is an example schema that been simplified:

Doc_id           <-- this is unique
Title                <-- always unique as well
Description    <-- contains mostly dupe data

I search over both the title and description but only return the title itself.

I'm fairly new to Solr but have been unable to find any information on how to tackle a scenario like this. In case it matters, I'm running Solr 5 on Ubuntu.

Thanks for any help!

have u made indexed=true to all three fields – Swaraj Apr 04 '15 at 08:02 — Swaraj, Apr 04 '15 at 08:02
@swaraj -yes, but what does that have to do with anything? – Jeremy Apr 04 '15 at 14:28 — Jeremy, Apr 04 '15 at 14:28

score 0 · Answer 1 · edited May 23 '17 at 12:06

I will try to provide some strategies to tackle your problem.

You are saying that you search over title and description, this means you should set these fields to indexed=true in your schema.xml. Only title is returned, this means only title needs to be set to stored=true, description should be set to stored=false. See this posting for more information on stored vs. indexed: Solr index vs stored
Another useful option you could try is the field option compression. If you need to store a field, you can use gzip compression on certain fields, such as TextField and StrField, see: https://wiki.apache.org/solr/SchemaXml for more info.
Lastly, deduplication is supported in Solr, see: https://wiki.apache.org/solr/Deduplication. I did not try this feature, but from the sounds of it, you can prevent (nearly) duplicate documents to be indexed or tag duplicates. Maybe its goal "Allow for both duplicate collapsing in search results as well as deduplication on adding a document." is what you are looking for?

Apache Solr 5 - deduplicating data within a field

1 Answers1