0

I need to obfuscate a MySQL table's IDs using a MySQL-only solution (no application code like PHP), and generate UNIQUE values.

Here's what I have managed to get working (using a MySQL trigger to update the row that was inserted with a comment_hash value. The problem is I get a "Duplicate Entry" error. I know hashes are not unique but how can I map a UNIQUE _comments.id value to a UNIQUE 4-character hash?)

DROP TABLE IF EXISTS _comments;
CREATE TABLE IF NOT EXISTS _comments (
    `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
    `comment_hash` VARCHAR(255) NOT NULL,
    `user_id` int(10) unsigned NOT NULL,
    PRIMARY KEY (`id`),
    UNIQUE KEY `comment_user_hash_parentid_unique` (`user_id`,`comment_hash`)
)  ENGINE=INNODB;


DROP TRIGGER IF EXISTS `test_add_comment_hash_on_insert_into_comments_table`;
DELIMITER //
CREATE TRIGGER test_add_comment_hash_on_insert_into_comments_table
BEFORE INSERT ON _comments
FOR EACH ROW
BEGIN
    SET NEW.comment_hash = right(md5(LAST_INSERT_ID()+1),4);
END
//
DELIMITER ;

And then in my seeder (I am using Laravel for the convenience of multiple inserts in a PHP for loop), but you can use anything for fast inserts.


<?php

use Illuminate\Database\Seeder;

use Faker\Generator as Faker;

// Call using command: php artisan db:seed --class=CommentsFakeDataSeeder
class WtfFakeDataSeeder extends Seeder
{
    /**
     * Seed the application's database.
     *
     * @return void
     */
    public function run(Faker $faker)
    {
        echo "Seeding _comments... \n";

        DB::statement('SET FOREIGN_KEY_CHECKS=0;');

        $fake_comments = [];

        for ($i = 0; $i < 10000; $i++) {
            $fake_comments[] = [
                'user_id'   => 1,
            ];
        }
        DB::table('_comments')->insert($fake_comments);

        DB::statement('SET FOREIGN_KEY_CHECKS=1;');
    }
}

I invoke the seeder with: php artisan db:seed --class=WtfFakeDataSeeder

But I always end up with an integrity constraint violation

Doctrine\DBAL\Driver\PDOException::("SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry '1-849b' for key 'comment_user_hash_parentid_unique'")

With lowercase-only alphanumeric characters allowed in the 4-character comment_hash, we should have 36^4 = 1679616 values. Yet, collisions occur with < 10K inserts as you can see below. md5() isn't the right solution here..do you know one that is? id's are always unique...whether or not I combine them with userID (to introduce more entropy), how can I map them to UNIQUE 4-character hash values?

kp123
  • 1,250
  • 1
  • 14
  • 24
  • Why are you using only 4 characters of the hash instead of the whole hash? If you use the whole hash, duplicates will probably never happen. – Barmar Sep 25 '19 at 23:47
  • 1
    MD5 returns a hex string, so there are only 16 different characters, and 16^4 is only 65536 values, and collisions are likely more often due to the birthday paradox. – Barmar Sep 25 '19 at 23:49
  • @Barmar I agree, I dont think md5 is the right approach. I'm looking for a better way so that we never encounter the birthday problem. I want a hash that utilizes at least 36 unique characters (a-z0-9) -> 36^4 = 1679616 values and most importantly, returns unique values. – kp123 Sep 26 '19 at 00:07
  • @Barmar I need to use 4 (max 5) characters to keep the URL length small. – kp123 Sep 26 '19 at 00:08
  • @Barmar this is incorrectly marked as a duplicate of https://stackoverflow.com/questions/5438760/generate-random-5-characters-string I am not looking for a PHP solution. This has to be in MySQL – kp123 Sep 26 '19 at 19:22
  • Sorry about that, try this: https://stackoverflow.com/questions/28962320/generate-a-random-string-in-mysql – Barmar Sep 26 '19 at 19:24
  • @Barmar There are several ways of generating random strings in MySQL....the uniqueness problem is what I am trying to solve for here. – kp123 Sep 26 '19 at 19:28
  • Think of it like this..I am trying to map (approximately) 1.6M UNIQUE numbers (0-1,600,000) each to a UNIQUE 4-character string in MySQL, with each character having at least 36 possible values (a-z0-9 if lowercase only, a-zA-Z0-9 if uppercase, which isn't the default "collation" in MySQL). – kp123 Sep 26 '19 at 19:50
  • The only way to guarantee uniqueness is to check if the string is already in the table, and repeat until you get one that isn't. If you make the string long enough you can just rely on the chance of duplication being infinitessimal. – Barmar Sep 26 '19 at 19:51
  • @Barmar I'm sure there's a way to "map" values in a unique way instead of having to use a sub-optimal algorithm like md5 and do a check for duplicates. I know this is not an easy problem to solve...but I don't see why this wouldn't be (theoretically) possible. – kp123 Sep 26 '19 at 20:21
  • You could write a procedure that fills a table with all 1.6M combinations, then reorders it randomly. – Barmar Sep 26 '19 at 20:22
  • @Barmar even with a procedure, there would need to be an algorithm that creates that sequence of 4-character strings in MySQL, with each character having at least 36 possible values. The "randomization" algo could be made deterministic so that a temporary table isn't needed. Can you think of a way to do that? – kp123 Sep 26 '19 at 22:28
  • 1
    You don't generate the strings randomly, you generate all combinations in order. Then you use a random shuffle to associate each of them with a different ID. – Barmar Sep 27 '19 at 16:31

0 Answers0