how to delete duplicate items in an efficient way

Question

The dataset structure like this:

   | text1 | text2|    
   | 23    |  43  |   
   | 44    |  23  |  
   | 23    |  44  |

After the deleting, the remain part should be:

   | text1 | text2|   
   | 23    |  43  |   
   | 23    |  44  |

If a.Text1 == b.text2 and a.text2 == b.text1, then delete one of both.

As I have around one million items, is there any efficient way to do this? I can use the Python and MySQL database if needed.

For tables without unique ID the only universal solution is to copy unique rows to a temporary table, truncate the original and insert from the temp. — PM 77-1, May 31 '14 at 19:18
@dkurbz Thanks guys. I have some methods, however, it would take around 10 hours. I am here to ask if there has any more efficient method. — Ding Ding, May 31 '14 at 19:22
For tables with unique IDs you already got answers in [your other question](http://stackoverflow.com/questions/22811865/how-to-remove-duplicate-items-in-mysql-with-a-dataset-of-20-million-rows?rq=1). — PM 77-1, May 31 '14 at 19:22
Well... Unless you plan to run MySQL on your old laptop it wouldn't take that long. — PM 77-1, May 31 '14 at 19:25

score 2 · Accepted Answer · answered May 31 '14 at 19:26

2

The fastest way to do this type of deletion is often to do truncate and insert. Something like:

create temporary table t as
    select least(text1, text2) as text1, greatest(text1, text2) as text2
    from dataset t
    group by least(text1, text2), greatest(text1, text2);

truncate table dataset;

insert into dataset(text1, text2)
    select text1, text2
    from t;

answered May 31 '14 at 19:26

Gordon Linoff

1,242,037
58
646
786

+1 and for an efficient way, instead of using a table for words, use `Trie`, where Nodes have a count – Khaled.K Jun 01 '14 at 06:29

how to delete duplicate items in an efficient way

1 Answers1