9

I have seen like a huge amount of data write to cosmos DB from stream analytics job on a particular day. It was not supposed to write huge amount of documents in a day. I have to check if there is duplication of documents on that particular day.

Is there any query/any way to find out duplicate records in cosmos DB?

Antony
  • 970
  • 3
  • 20
  • 46

2 Answers2

15

It is possible if you know the properties to check for duplicates. We had a nasty production issue causing many duplicate records as well. Upon contacting MS Support to help us identify the duplicate documents, they gave us the following query;

Bear in mind: property A and B together define the uniqueness in our case. So if two documents have the same value for A and B, they are duplicate. You can then use the output of this query to, for example, delete the oldest ones but keep the recent (based on _ts)

SELECT d.A, d.B From 
   (SELECT c.A, c.B, count(c._ts) as counts FROM c
    GROUP BY c.Discriminator, c.EndDateTime) AS d
WHERE d.counts > 1
Mortana
  • 1,332
  • 3
  • 15
  • 29
  • 1
    How would you list the unique ids of these documents? If you wanted to find duplicates and then delete the duplicate item? – Oliver Nilsen Apr 04 '23 at 17:11
-1

Is there any query/any way to find out duplicate records in cosmos DB?

Quick answer is YES.Please use distinct keyword in the cosmos db query sql.And filter the _ts(System generated unix timestamp:https://learn.microsoft.com/en-us/azure/cosmos-db/databases-containers-items#properties-of-an-item)

Something like:

Select distinct c.X,c.Y,C.Z....(all columns you want to check) from c where c._ts = particular day

Then you could delete the duplicate data using this bulk delete lib:https://github.com/Azure/azure-cosmosdb-bulkexecutor-dotnet-getting-started/tree/master/BulkDeleteSample.

Jay Gong
  • 23,163
  • 2
  • 27
  • 32
  • Suppose I have 3000 records written into cosmos DB in one day. And in that 2000 records are duplicate records. If I run the above DISTINCT query , i will get only the distinct 1000 records only? The 2000 duplicates will not be received in output ?@Jay Gong – Antony Dec 10 '19 at 09:07
  • @Antony I'm afraid the output is limited by the RUs settings,so only 1000 records will show up. And the left 1000 records could be searched by the continuation token.Don't increase RUs settings because it will increase your cost. Could you deal with these data in two parts or three parts? Because the `_ts` logs the specific time when the data streams into db.So i think the data could be divided into several parts. – Jay Gong Dec 10 '19 at 09:13
  • 2
    But the DISTINCT keyword will retrieve only unique(non duplicate) records. How I will find duplicate records from that?@Jay Gong – Antony Dec 10 '19 at 09:18
  • @Antony, did you manage to find the query? I am currently facing the same issue... – Gaelle Sou Apr 07 '20 at 13:37
  • @GaelleSou I've posted an answer that will help you to identify the duplicates – Mortana Apr 09 '20 at 08:09
  • @Mortana Really appreciate to your sharing,will check that. – Jay Gong Apr 09 '20 at 08:12