This can be accomplished in several ways. Below I outline two possible approaches:
1) If you don't mind generating new _id
values and reindexing all of the documents into a new collection, then you can use Logstash and the fingerprint filter to generate a unique fingerprint (hash) from the fields that you are trying to de-duplicate, and use this fingerprint as the _id
for documents as they are written into the new collection. Since the _id
field must be unique, any documents that have the same fingerprint will be written to the same _id
and therefore deduplicated.
2) You can write a custom script that scrolls over your index. As each document is read, you can create a hash from the fields that you consider to define a unique document (in your case, the content
field). Then use this hash as they key in a dictionary (aka hash table). The value associated with this key would be a list of all of the document's _id
s that generate this same hash. Once you have all of the hashes and associated lists of _id
s, you can execute a delete operation on all but one of the _id
s that are associated with each identical hash. Note that this second approach does not require writing documents to a new index in order to de-duplicate, as you would delete documents directly from the original index.
I have written a blog post and code that demonstrates both of these approaches at the following URL: https://alexmarquardt.com/2018/07/23/deduplicating-documents-in-elasticsearch/
Disclaimer: I am a Consulting Engineer at Elastic.