1

How can I find the number of duplicates in each document in Java-MongoDB I have collection like this. Collection example:

{
    "_id": {
        "$oid": "5fc8eb07d473e148192fbecd"
    },
    "ip_address": "192.168.0.1",
    "mac_address": "00:A0:C9:14:C8:29",
    "url": "https://people.richland.edu/dkirby/141macaddress.htm",
    "datetimes": {
        "$date": "2021-02-13T02:02:00.000Z"
    }
{
    "_id": {
        "$oid": "5ff539269a10d529d88d19f4"
    },
    "ip_address": "192.168.0.7",
    "mac_address": "00:A0:C9:14:C8:30",
    "url": "https://people.richland.edu/dkirby/141macaddress.htm",
    "datetimes": {
        "$date": "2021-02-12T19:00:00.000Z"
    }
}
{
    "_id": {
        "$oid": "60083d9a1cad2b613cd0c0a2"
    },
    "ip_address": "192.168.1.5",
    "mac_address": "00:0A:05:C7:C8:31",
    "url": "www.facebook.com",
    "datetimes": {
        "$date": "2021-01-24T17:00:00.000Z"
    }
}

example query:

            BasicDBObject whereQuery = new BasicDBObject();
            DBCursor cursor = table1.find(whereQuery);
            while (cursor.hasNext()) {
                DBObject obj = cursor.next();
                String ip_address = (String) obj.get("ip_address");
                String mac_address = (String) obj.get("mac_address");
                Date datetimes = (Date) obj.get("datetimes");
                String url = (String) obj.get("url");
                System.out.println(ip_address, mac_address, datetimes, url);
            }

in Java, How I can know count duplicated data of "url". And how many of duplicated.

prasad_
  • 12,755
  • 2
  • 24
  • 36
Kamijou
  • 29
  • 6
  • See this similar post: [Find duplicate urls in MongoDB](https://stackoverflow.com/questions/61062508/find-duplicate-urls-in-mongodb). – prasad_ Feb 15 '21 at 08:59

2 Answers2

0

If I understand your question correctly you're trying to find the amount of duplicate entries for the field url. You could iterate over all your documents and add them to a Set. A Set has the property of only storing unique values. When you add your values, the ones that are already in the Set will not be added again. Thus the difference of the number of entries in the Set to the number of documents is the amount of duplicate entries for the given field.

If you wanted to know which URLs are non-unique, you could evaluate the return value from Set.add(Object) which will tell you, whether or not the given value has been in the Set beforehand. If it has, you got yourself a duplicate.

MaxRumford
  • 175
  • 10
0

in mongodb you can solve this problem with "Aggregation Pipelines". You need to implement this pipeline in "Mongodb Java Driver". It gives only duplicated results with their duplicates count.

db.getCollection('table1').aggregate([
   {
        "$group": {
            // group by url and calculate count of duplicates by url 
            "_id": "$url",
            "url": {
                "$first": "$url"
            },
            "duplicates_count": {
                "$sum": 1
            },
            "duplicates": {
                "$push": {
                    "_id": "$_id",
                    "ip_address": "$ip_address",
                    "mac_address": "$mac_address",
                    "url": "$url",
                    "datetimes": "$datetimes"
                }
            }
        }
    },
    {   // select documents that only duplicates count higher than 1
        "$match": {
            "duplicates_count": {
                "$gt": 1
            }
        }
    },
    {
        "$project": {
            "_id": 0
        }
    }
]);

Output Result:

{
    "url" : "https://people.richland.edu/dkirby/141macaddress.htm",
    "duplicates_count" : 2.0,
    "duplicates" : [ 
        {
            "_id" : ObjectId("5fc8eb07d473e148192fbecd"),
            "ip_address" : "192.168.0.1",
            "mac_address" : "00:A0:C9:14:C8:29",
            "url" : "https://people.richland.edu/dkirby/141macaddress.htm",
            "datetimes" : {
                "$date" : "2021-02-13T02:02:00.000Z"
            }
        }, 
        {
            "_id" : ObjectId("5ff539269a10d529d88d19f4"),
            "ip_address" : "192.168.0.7",
            "mac_address" : "00:A0:C9:14:C8:30",
            "url" : "https://people.richland.edu/dkirby/141macaddress.htm",
            "datetimes" : {
                "$date" : "2021-02-12T19:00:00.000Z"
            }
        }
    ]
}