1

I have a table called metadata which has a column called data of type TEXT. There are currently about 10 million rows in the metadata table.

I want to insert additional rows into the metadata table but I need to ensure that no two rows have the same content in data field(no duplicate data). The content of each data field is about 100 to 1000 lines long string i.e. extremely large strings.

What is the best way to insert a new row while ensuring that there are no duplicates in a performant manner?

Raqib
  • 1,367
  • 11
  • 24
  • 4
    add a unique constraint on that column – mike.k Jun 21 '18 at 01:01
  • See [this](https://stackoverflow.com/a/469553/4982088) answer. – Xorifelse Jun 21 '18 at 01:04
  • 2
    I'm afraid you won't be happy with the performance of such validation. If you have no other field (e.g. id), you have to create an unique constraint on the column `data`, as suggested by @mike.k – Jim Jones Jun 21 '18 at 07:22
  • @JimJones, what would be ways to improve the performance? Could you please explain the reasons why the performance will be affected or point me to some useful resource? :) – Raqib Jun 28 '18 at 20:49
  • @raqib could you add some samples of this metadata table to your question? If there is no other way to uniquely identify the records, you have no other choice than create a PK using the `data` field. The inserts will be slow because the database was to look for a string match on the whole table in order to tell if the record already exists or not. A numerical comparison, for example, is much more efficient.. that's why I asked if there was any way add an id to the records. – Jim Jones Jun 29 '18 at 16:29

0 Answers0