Distinct Grouping on Large table

Question

I have a decently large table (~20 million) with a primary key column and another column with large text strings (~ 250 char). The primary key column is all unique but the text strings have repeats. I want to remove all redundant text strings and “distinctify” the table or create a new table to the same affect. I want to leave only rows with distinct text strings. I don’t care which of the several primary keys with each url are thrown away. Even running a select distinct on only the string column results in memory overflow. What other options are there?

You can try a `group by` instead . . . and then try an index on the colum . . . otherwise, I think you might need a staged process of some sort. — Gordon Linoff, Jul 15 '15 at 21:12
Do you have any idea what proportion of the table are duplicates? — Martin Smith, Jul 15 '15 at 21:20
Really a select distinct on the other columns results in memory overflow? Please post the actual message. — paparazzo, Jul 15 '15 at 22:17

score 1 · Accepted Answer · answered Jul 15 '15 at 21:16

Your memory overflow is likely due to returning such a large resulset back to SSMS. If you Select min(ID), TextColumn Into TableXYZ Group By TextColumn, you should get around this issue. Once the 'Distinctified' results are in a separate table, you can then go about deleting/archiving the records in the original table.

score 1 · Answer 2 · edited May 23 '17 at 12:29

At one level this is the same question as that asked in How can I remove duplicate rows?.

You could use the ROW_NUMBER approach but this will require sorting the 20 million rows. Or the GROUP BY approach which may be less memory demanding if it uses a hash aggregate.

Another more offbeat approach that might be considered here, as you are anticipating less than 50% of the table will be retained, would be to create a new table as below

CREATE TABLE Deduped
(
Id INT,
CharColumn VARCHAR(255) PRIMARY KEY CLUSTERED WITH (IGNORE_DUP_KEY = ON)
)

and insert all rows into it

DECLARE @I INT = 2000000000;
INSERT INTO Deduped
SELECT TOP(@I) Id, CharColumn
FROM   OriginalTable
OPTION (OPTIMIZE FOR (@I = 0))

This may well avoid any memory consuming operators at all as duplicates are discarded using the B tree being created.

score 1 · Answer 3 · answered Jul 15 '15 at 22:14

select column1_primary, column2_text, count(*) from table group by column2_text, column1_primary   having count(*) > 1

After this you will get records containing duplicate text then you can run:

Delete from table where column2_primary in(select column1_primary, count(*) from table group by column2_text, column1_primary   having count(*) > 1)

This will remove all the duplicate text from your table and the left behind table will consist of unique strings or you can use insert select to insert rows in new table.

I have tested this on a 7 million records table and it takes around 1 minutes to select the rows.

score 0 · Answer 4 · answered Jul 15 '15 at 21:20

Better than finding and deleting could be inserting into a new table. You could directly SELECT your distinct result into a new table. If something like COUNT() OVER or GROUP BY won't work, you still could use a CURSOR and do this one by one. Use an unique key and BEGIN TRY...CATCH.

Distinct Grouping on Large table

4 Answers4