3

I am going to store filenames and other details in a table where, I am planning to use sha1 hash of the filename as PK.

  • Q1. SHA1 PK will not be a sequentially increasing/decreasing number. so, will it be more resource consuming for the database to maintain/search_into and index on that key? If i decide to keep it in database as 40 char value.

  • Q2. I read here: https://stackoverflow.com/a/614483/986818 storing the data as binary(20) field. can someone advice me in this regard:

  • a) do i have to create this column as: TYPE=integer, LENGTH=20,
    COLLATION=binary, ATTRIBUTES=binary?
  • b) how to convert the sha1 value in MySQL or Perl to store into the table?
  • c) is there a danger of duplicacy for this 20 char value?

**

---------UPDATE-------------

**

The requirement is to search the table on filename. user supplies filename, i go search the table and if filename is not there adds it. So either i index on varchar(100) filename field or generate a column with sha1 of the filename - hoping it would be easy for indexing for MySql compared to indexing a varchar field. Also i can search using the sha1 value from my program against the sha1 column. what say? primary key or just indexd key: i choose PK coz DBIx likes using PK. and PK or INDEX+UNIQ would be same amount of overhead for the system(so i thought)

Community
  • 1
  • 1
rajeev
  • 1,275
  • 7
  • 27
  • 45
  • 1
    Why use the hash as a PK and not just a unique column with a conventional PK (Guid/int)? – Jared Aug 19 '12 at 20:05
  • Also, do keep in mind that hashes are NOT unique. An infinite number of data inputs map to the same hash value (although I don't know whether, for a specific hash algorithm, this would be the case of every possible hash value. There might be some with just a finite number of data inputs for certain hash values). – Christian Stieber Aug 19 '12 at 20:12
  • Am I missing something here? I assume your file names have a location attached to them? As name/location should be unique just use an already auto-increment pk with a separate unique key on file name and location. – Ben Aug 19 '12 at 20:16

5 Answers5

0

There is no reason to use a cryptographically secure hash here. Instead, if you do this, use an ordinary hash. See here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

The hash is NOT a 40 char value! It's a 160 bit number, and you should store it that way (as a 20 char binary field). Edit: I see you mentioned that in comment 2. Yes, you should definitely do that. But I can't tell you how since I don't know what programming language you are using. Edit2: I see it's perl - sorry I don't know how to convert it in perl, but look for "pack" functions.

No, do not create it as type integer. The maximum integer is 128 bits which doesn't hold the entire thing. Although you could really just truncate it to 128 bits without real harm.

It's better to use a simpler hash anyway. You could risk it and ignore collisions, but if you do it properly you kinda of have to handle them.

Community
  • 1
  • 1
Ariel
  • 25,995
  • 5
  • 59
  • 69
  • There is no reason to use a hash at all. Hashes have collisions. You don't want to have collisions in your primary key. Not even if collisions are "really rare". – Tomalak Aug 19 '12 at 20:12
0

If i decide to keep it in database as 40 char value.

Using a character sequence as a key will degrade performance for obvious reasons.

Also the PK is supposed to be unique. Although it will be probably be unlikely that you end up with collisions (theoretically using that for a function to create the PK seems inappropriate.
Additionally anyone knowing the filename and the hash you use, would know all your database ids. I am not sure if this is something not to consider.

Cratylus
  • 52,998
  • 69
  • 209
  • 339
0

Q1: Yes, it will need to build up a B-Tree of nodes that contain not only 1 Integer (4 Bytes) but a CHAR(40). Speed would be aproximately the same, as long the INDEX is kept in memory. As the entries are about 10 times bigger, you need 10 times more memory to keep it in memory. BUT: You probably want to lookup by the Hash anyway. So you'll need to have it either as Primary key OR as an Index.

Q2: Just create a Table field like CREATE TABLE test (ID BINARY(40), ...); later you can use INSERT INTO test (ID, ..) VALUES (UNHEX('4D7953514C'), ...);

-- Regarding: Is there a danger of duplicacy for this 20 char value?

The chance is 1 in 2^(8*20). 1 in 1,46 * 10^48 ... or 1 of 14615016373309029182036848327163*10^18. So the chance for that is very very v.. improbable.

SDwarfs
  • 3,189
  • 5
  • 31
  • 53
  • Ty.Hi, I wanted to know the complete details on the values supplied to build the column (I am using phpMyAdmin to work on my database). – rajeev Aug 19 '12 at 20:25
  • You'll need to create a column of type BINARY(40); You either can select this as column type (maybe not available)... or you need to write your own SQL and use the EXECUTE SQL command of phpMyAdmin. – SDwarfs Aug 19 '12 at 20:37
  • In your source code you can use UNHEX('....SHA1 HASH.....') to convert the data from HEX (containing only 0123456789abcdef) into binary (which is more compact, 1/2 of the size). – SDwarfs Aug 19 '12 at 20:41
  • It's probably easier if you'd use md5() for hashing, if you just need to find "doubles". If you use a normal INT as primary key, you only need the slow hash-lookup for uploads. For downloads you can use the ID, supplied in the link itself. If you don't want people to just change the ID from 10002 to 10003 and getting the next file, you can put the hash into link too and check if the hash matches the file entry. For simplicity you can store the hash as it is into a string column (e.g. VARCHAR(40) or for md5 VARCHAR(32)). – SDwarfs Aug 19 '12 at 20:47
0

I would stick with the standard auto-incrementing integer for the primary key. If uniqueness of file names is important (which it sounds like it is), then you can add a UNIQUE constraint on the file name itself or some derived, canonical version of the file name. Most languages/frameworks have some sort of method for getting a canonical version of a path (relative to absolute, standardized case, etc).

If you implement my suggestion or pursue your original plan, then you should be aware that multiple strings can map to the same filename/path. Both versions will have different hashes/pass the uniqueness constraint but will actually both refer to the same file. This depends on operating system and may or may not be a problem for you. Just something to keep in mind.

colithium
  • 10,269
  • 5
  • 42
  • 57
0

Ok, then use a very -short- hash on the filename and accept collisions. Use an integer type for it (thats much faster!!!). E.g. you can use md5(filename) and then use the first 8 characters and convert them to an integer. SQL could look like this:

CREATE TABLES files (
  id INT auto_increment,
  hash INT unsigned,
  filename VARCHAR(100),

  PRIMARY KEY(id),
  INDEX(hash)
);

Then you can use:

SELECT id FROM files WHERE hash=<hash> AND filename='<filename>';

The hash is then used for sorting out most other files (normally all other files) and then the filename is for selecting the right entry out of the few hash collisions.

For generating an integer hash-key in perl I suggest using md5() and pack().

SDwarfs
  • 3,189
  • 5
  • 31
  • 53
  • Though I agree to all, and appreciate everyone's time and effort. I think this suggestion tries to address the real issue. I like this suggestion as it addresses the thought of minimizing resource usage and programming steps in filename lookup in a table, by filename. – rajeev Aug 20 '12 at 15:07
  • Note that matching filenames by hash is case-sensitive. If you want it case-insensitive convert the string to lowercase (or uppercase) before hashing. – SDwarfs Aug 20 '12 at 15:37