0

I have a DB with news articles, and I am trying to do a little DB cleaning. I want to find all duplicate documents, and the best way i think to accomplish this by using the url field. My documents are structured as follows:

{
    _id:
    author:
    title:
    description:
    url:
    urlToImage:
    publishedAt:
    content:
    summarization:
    source_id:
}

Any help is greatly appreciated

MaxxABillion
  • 541
  • 2
  • 9
  • 26

1 Answers1

2

Assuming a collection documents with name (using name instead of url) field consisting duplicate values. I have two aggregations which return some output which can be used to do further processing. I hope you will find this useful.

{ _id: 1, name: "jack" },
{ _id: 2, name: "john" },
{ _id: 3, name: "jim" },
{ _id: 4, name: "john" }
{ _id: 5, name: "john" }, 
{ _id: 6, name: "jim" }

Note that "john" has 3 occurrances and "jim" has 2.

(1) This aggregation returns the names which have duplicates (more than one occurance):

db.collection.aggregate( [
  { 
      $group: { 
          _id: "$name", 
          count: { $sum: 1 }
      } 
  },
  { 
      $group: { 
          _id: "duplicate_names", 
          names: { $push: { $cond: [ { $gt: [ "$count", 1 ] }, "$_id", "$DUMMY" ] } }
      }
  }
] )

The output:

{ "_id" : "duplicate_names", "names" : [ "john", "jim" ] }


(2) The following aggregation just returns the _id field values for the duplicate documents. For example, the name "jim" has _idvalues 3 and 6. The output has only the id's for the duplicate documents, i.e., 6.

db.colection.aggregate( [
  { 
      $group: { 
          _id: "$name", 
          count: { $sum: 1 }, 
          ids: { $push: "$_id" } 
      } 
  },
  { 
      $group: { 
          _id: "duplicate_ids",  
          ids: { $push: { $slice: [ "$ids", 1, 9999 ] } } 
      } 
  },
  { 
      $project: { 
          ids: { 
              $reduce: {
                  input: "$ids", 
                  initialValue: [ ],
                  in: { $concatArrays: [ "$$this", "$$value" ] }
              }
          }
      } 
  }
] )

The output:

{ "_id" : duplicate_ids", "ids" : [ 6, 4, 5 ] }
prasad_
  • 12,755
  • 2
  • 24
  • 36
  • 1
    NOTE: The above method is also useful when where there are duplicates based upon _multiple fields_. – prasad_ Feb 01 '21 at 10:30
  • I found that this works on small collection. With large collections I added the {"allowDiskUse": true} after the first argument. e.g db.colection.aggregate( [],{"allowDiskUse": true}) – kriscondev Apr 19 '22 at 07:30
  • @KristCont The stages in the pipeline has limited memory allocated during the process. In case the memory required exceeds the memory limit, you will need to use the option `{ "allowDiskUse": true }`. This can happen with large data sets as in your case. Some related info from the docs: [Aggregation Pipeline Limits](https://www.mongodb.com/docs/manual/core/aggregation-pipeline-limits/). – prasad_ Apr 19 '22 at 08:04