4

I want make value of TEXT field unique in my MySQL table.

After small research I discovered that everybody are discouraging using UNIQUE INDEX on TEXT fields, due to performance issues. What I want to use now is:

1) create another field to contain hash of TEXT value (md5(text_value))

2) make this hash field UNIQUE index

3) use INSERT IGNORE in queries

Is this solution complete, secure and optimal? (found it on SO)

Is there a better way of achiving this goal?

Gury Max
  • 161
  • 2
  • 14
  • 1
    Use `VARCHAR(32)` OR `CHAR(32)` See other topic : http://stackoverflow.com/questions/247304/mysql-what-data-type-to-use-for-hashed-password-field-and-what-length You can make this field UNIQUE and whatever you wan't. – JoDev Mar 08 '13 at 13:24
  • 1
    Seems a good task for a trigger. See http://dev.mysql.com/doc/refman/5.0/en/create-trigger.html – Ghigo Mar 08 '13 at 13:27
  • If you'd do what you suggest above, you'd only do manually what MySQL would do if you created a UNIQUE index employing HASHes. And most probably it wouldn't be as fast. – 0xCAFEBABE Mar 08 '13 at 14:25
  • @0xCAFEBABE Thanks for pointing it out. Do you have any other idea how to solve this? – Gury Max Mar 08 '13 at 14:43
  • Bear in mind that two TEXT fields which differ only by the number of trailing spaces will general unique hashes but a unique index on the TEXT field itself would throw duplicate-key errors. – APC Mar 09 '13 at 20:55
  • @APC I'm not sure I follow. You mean that I shouldn't create this hash field with UNIQUE index? – Gury Max Mar 10 '13 at 11:42
  • What I mean is, these two strings will generate unique hashes: "This is some text." and "This is some text. ". I don't think that fits any reasonable definitive of *unique text*. However, only you know the actual business rules you're trying to enforce, so maybe it will be okay. – APC Mar 10 '13 at 11:47

2 Answers2

3

As I was asked in the comments how I would solve this, I'll write it as a response.

Being in such a situation suggests mistakes in the application design. Consider what that means.

You have a text of which you cannot specify the length in advance, and which can be extremely long (up to 64k), of which you want to keep uniqueness. Imagine such an amount of data split into separate keys, and composing a composite index to generate uniqueness. This is what you're trying to do. For integers, this would be an index of 16000 integers, joined in a composite index.

Consider further that CHARACTER type fields (CHAR, VARCHAR, TEXT) underly interpretation by encoding, which further complicates the issue.

I'd highly recommend splitting the data up somehow. This not only frees the DBMS from incorporating variable length character blocks, but also might give some possibility of generating composite keys over parts of the data. Maybe you could even find a better storage solution for your data.

If you have questions, I'd suggest posting the table and/or database structure and explaining what logical data the TEXT field contains, and why you think it would need to be unique.

0xCAFEBABE
  • 5,576
  • 5
  • 34
  • 59
2

It’s almost complete. There is a chance (Birthday Paradox) that there will be a collision with a hash so a UNIQUE index alone isn’t enough.

You’re better using a hash along with a comparison check to be completely safe.

SELECT COUNT(*) FROM table
WHERE md5hash = MD5(text)
AND textvalue = text

This could be wrapped into an INSERT or UPDATE TRIGGER – or maybe even a STORED PROCEDUR for easy checking.

Have a look at this Stack Overflow question for an example of a hash collision.

Community
  • 1
  • 1
Steve
  • 3,673
  • 1
  • 19
  • 24
  • Bear in mind that if the strings are meaningful text following some restrictive rules, such as those that define a natural language, the probability of a hash collision becomes diminishingly small. – eggyal Mar 08 '13 at 14:16
  • @eggyal I totally agree, very very very small ... but not impossible. – Steve Mar 08 '13 at 14:18