-2

The dataset structure like this:

   | text1 | text2|    
   | 23    |  43  |   
   | 44    |  23  |  
   | 23    |  44  |

After the deleting, the remain part should be:

   | text1 | text2|   
   | 23    |  43  |   
   | 23    |  44  |

If a.Text1 == b.text2 and a.text2 == b.text1, then delete one of both.

As I have around one million items, is there any efficient way to do this? I can use the Python and MySQL database if needed.

Ding Ding
  • 295
  • 3
  • 7

1 Answers1

2

The fastest way to do this type of deletion is often to do truncate and insert. Something like:

create temporary table t as
    select least(text1, text2) as text1, greatest(text1, text2) as text2
    from dataset t
    group by least(text1, text2), greatest(text1, text2);

truncate table dataset;

insert into dataset(text1, text2)
    select text1, text2
    from t;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • +1 and for an efficient way, instead of using a table for words, use `Trie`, where Nodes have a count – Khaled.K Jun 01 '14 at 06:29