Finding duplicate entries in a table with 1 billion rows

Question

I need to find duplicate entries on 2 columns out of 5 on a table containing 1 billion rows.

In detail:

Duplicate entries on 2 columns means: column a can have repeated entries and column b can have repeated entries, but both columns considered together cannot have repeated entries.

Reason for this:

I need to find out what duplicate record were erroneously inserted during backfill of data as the backfill was done without the table having a primary key.

The table has following columns:

id, warehouse, quantity, date, updated by

The duplicate entries need to be found on id + warehouse.

I tried using

select id, warehouse
from my_table
group by id, warehouse
having count(*) > 1

This does not give me the duplicates on combination of id and warehouse.

I am afraid of doing a self join on the table as the operation would take too long on such a big table.

Please help me figure out the fastest way of getting duplicates.

Also, as an added challenge, i need to delete the duplicate entries (keeping only one record of the duplicate entries in the table). Is there a fast way of doing that on this huge a table.

When I try to set primary key, the query is stuck on "copy to tmp table" step for about 48 hours with a metadata lock on the table preventing any inserts.

Details on database:

Engine InnoDB

Server mysql

RAM 7.5GB

Your group-by query seems right, are you sure it doesn't work? — František Žiačik, Aug 26 '15 at 07:31
so col A can have "AAA AAA" and col B can have "AAA AAA" considered separately, but at the row level it is considered a duplicate ? — Drew, Aug 26 '15 at 07:38
or do you mean because of the backfill problem without PK, that you need to do a count across multiple rows? It is ambiguous to me at least — Drew, Aug 26 '15 at 07:39
@Sushant: Please read what i wrote. I am not doing a self join because a simple select count(*) is taking so long. So, do you have a way i can join such that the query does not take too long? — akshitBhatia, Aug 26 '15 at 09:16
@FrantišekŽiačik: The group by query has been running for 4 hours and is in "sending data" state. Is there a faster method? Should i add index on id and warehouse? — akshitBhatia, Aug 26 '15 at 09:18
An index covering both id and warehouse would massively improve performance. But adding that index might take quite a while. — Kickstart, Aug 26 '15 at 09:20
@Drew: I need to count across multiple rows. The data feed file i am backfilling from has id+warehouse as primary key. But since i was backfilling on a table without the primary key (to increase insert speed) some of the backfill jobs overlapped and duplicate data was inserted. I need to find this data by looking at unique combinations of id+warehouse in my db. Did that clear it? — akshitBhatia, Aug 26 '15 at 09:20
@Kickstart: Will adding index on these columns increase the speed of the following command: alter ignore table my_table add primary key pk1 (id,warehouse). This query takes too long and has a metadata lock on table preventing any other operations. But if it can be done quickly, it is perfect for removing duplicates. — akshitBhatia, Aug 26 '15 at 09:23
Honestly do not know as I have little knowledge on the processing details of MySQL for adding a key / index. — Kickstart, Aug 26 '15 at 09:28
the grouping query you have should identify the offending groups, but it won't solve the problem, so once you the groups identified I assume you will want to remove the duplications - however you haven't uniquely identified which row or rows within each group to remove so now you have a follow on problem (and more processing). What happens if/when `quantity, date, updated by` differ for groups that contain duplicates? which will you retain, which will you remove? I'd want a better plan — Paul Maxwell, Aug 26 '15 at 13:34

score 2 · Answer 1 · answered Aug 27 '15 at 13:49

I have create a solution in MSSQL but it will run better in MySQL because the offset limit in MySQL doesn't need the Order by set.

The solution is the slowest, but you can run it during night, stop it in the morning, and at restart will continue from where was left. And is not memory intensive.

CREATE TABLE findduplicates(id NVARCHAR(50), warehouse NVARCHAR(50), occurence INT)
CREATE UNIQUE INDEX uniqueIndex ON findduplicates(id,warehouse); 

CREATE TABLE integerValue(id int)
INSERT INTO integerValue VALUES(0)

DECLARE @i INT
SELECT @i = id from integerValue

DECLARE @id NVARCHAR(50)
DECLARE @warehouse NVARCHAR(50)


SELECT
    @id = id,
    @warehouse = warehouse
FROM 
    duplicatest
ORDER BY
    1
OFFSET @i ROWS 
FETCH NEXT 1 ROWS ONLY;

WHILE(@id IS NOT NULL)
BEGIN
IF EXISTS(SELECT TOP 1 1 FROM findduplicates WHERE id = @id and warehouse = @warehouse)
BEGIN
    UPDATE findduplicates SET occurence = occurence + 1 WHERE id = @id and warehouse = @warehouse
END
ELSE
BEGIN
    INSERT INTO findduplicates VALUES(@id,@warehouse,1)
END

SET @id = NULL
SET @i = @i + 1
UPDATE integerValue SET id = @i

SELECT
    @id = id,
    @warehouse = warehouse
FROM 
    duplicatest
ORDER BY
    1
OFFSET @i ROWS 
FETCH NEXT 1 ROWS ONLY;
END

score 1 · Answer 2 · answered Aug 26 '15 at 12:09

As I understand it you need to find the duplicate entries and then do something with them. As the commentators have already noted if you can add an index and run the self join outside of production hours then I would definitely go ahead and do this.

If on the other hand you can't then effectively you're facing a bucketing problem. The issue with self joins without indexes is that (in SQL Server at least) a nested loop join will probably be used to produce the resultset. I'm assuming here that mysql will use the same logic. Nested loops have a big O of O(n^2) which for a dataset of your size will be very expensive.

If we assume that the cardinality of the warehouse column is relatively small, then the best approach is to split your data into subsets by warehouse and perform the checks on each of these.

At worst case a select statement will be O(n), i.e. full table scan. While if there's an index on the warehouse column at all then the select should approximate O(log(n)).

There are then two approaches to solving you issue.. 1: Create a new table for each of your warehouses and insert all data from your main table into these. The time for this should be O(log(n))*num warehouses for the select and O(1) for each insert to the new tables. Once the new tables are created then create your indexes on each. This would be my preferred approach.

2: If this is not possible for whatever reason you could use a code based approach. The key here would be to use hashsets which have an O(1) time for operations. I've provided some sample code below in C# to show how this could be done

    using System.Collections.Generic;
    using System.Linq;

    namespace ConsoleApplication9
    {
        class Program
        {
            /// <summary>
            /// represents the totality of all records in the database
            /// </summary>
            static List<WareHouse> wareHouseItemsList = new List<WareHouse>(); 

            static void Main(string[] args)
            {
                AddWareHouseValues(); // simulate the full table by populating some values

                for (var i = 0; i < 2; i++)
                {
                    // simulate retriveing the data from db - this is O(log(n)) assuming an index on warehouse id
                    var individualWarehouseItems = from item in wareHouseItemsList
                        where item.WarehouseID == 1
                        select item.ItemID;

                    var integerSet = new HashSet<int>();
                    var list = integerSet.AddRange(individualWarehouseItems);  // Hashset operations are O(1)
                    // do something with list .... 
                }

            }


            static void AddWareHouseValues()
            {            
                wareHouseItemsList.Add(new WareHouse {WarehouseID = 1, ItemID = 1}); // create a duplicate for WH 1
                wareHouseItemsList.Add(new WareHouse { WarehouseID = 2, ItemID = 11 }); // create a duplicate for WH 2

                for (var j = 1; j < 3; j++)
                {
                    for (var i = 1; i < 20; i++)
                    {
                        wareHouseItemsList.Add(new WareHouse {WarehouseID = j, ItemID = i});
                    }
                }
            }
        }



        public class WareHouse
        {
            public int WarehouseID { get; set; }
            public int ItemID { get; set; }
        }

        public static class Extensions
        {
            /// <summary>
            /// Tries to add a range of intergers to a hashset and returns any that failed
            /// </summary>
            /// <param name="this">hashset</param>
            /// <param name="items">collection of integers</param>
            /// <returns></returns>
            public static IEnumerable<int> AddRange(this HashSet<int> @this, IEnumerable<int> items)
            {
                foreach (var item in items)  // This is O(n) however n here is much smaller than full dataset
                {
                    var allAdded = true;
                    if (!(allAdded &= @this.Add(item)))  // This is O(1)
                    {
                        yield return item;
                    }
                }

            }
        }
    }

score 0 · Answer 3 · answered Aug 26 '15 at 11:41

0

Would doing the following work for you:

INSERT INTO newtable(id,warehouse)
SELECT DISTINCT id,warehouse
FROM my_table

Confirm newtable is correct

DROP TABLE my_table;
RENAME TABLE newtable TO my_table

Its not pretty, but it is probably quicker than most other solutions

answered Aug 26 '15 at 11:41

Thomas Steven

449
5
13

This is extremely inefficient when it comes to memory. I will have to increase the disk space of my DB instance to duplicate the table. 1 billion rows would have approximately 10% duplicate records. So, 90% of the records will be copied. This copy will take too long and take a lot of memory. – akshitBhatia Aug 26 '15 at 11:53
and what is the plan for the other columns? – Paul Maxwell Aug 26 '15 at 13:36

Finding duplicate entries in a table with 1 billion rows

3 Answers3