0

This question does not need any code, it's just a conceptual thing about MD5 hashing.

My app manages a community of users.

I use MD5 hashing to reduce a user nickname of arbitrary length to a hash. I expect the MD5 of every nick to be different, because this MD5(nick) will be kind of my user ID for every user.

Is this always true? I'm sure I'm missing something and there can be collisions in the long term (millions of users === millions of different nicks with different lengths)

devnull
  • 118,548
  • 33
  • 236
  • 227
rupps
  • 9,712
  • 4
  • 55
  • 95
  • why don't you just use the nick as userid? – Jens Schauder Dec 22 '13 at 09:39
  • Yes, there can be collisions. It's unlikely though. Plus MD5 is broken. – Mitch Wheat Dec 22 '13 at 09:39
  • @JensSchauder a subsystem does this map to store files in a server, we just wonder if collisions are something to be concerned about, but for the server efficiency is really convenient that all folders are just hex numbers – rupps Dec 22 '13 at 09:40
  • You can combine 2 hashes into 1. Eg: first 16 chars from md5 and first 16 chars from sha1. This is not possible to find 2 the same hashes in this situation... – Krzysiek Dec 22 '13 at 09:42
  • @MitchWheat That it's broken I think is not a concern for us because it's not used for security, but just for mapping usernames to a directory structure – rupps Dec 22 '13 at 09:42
  • @Krzysiek Good idea, but I was trying to avoid that because the md5's are generated directly in the database, which is very quick, but I suppose to compute a SHA will be much slower ... the optimization on hex numbers will be killed by the SHA overhead :( – rupps Dec 22 '13 at 09:45
  • I have database of images and i have 2 columns md5 and sha1 for checking uniqueness. In db you can do something like this: `CONCAT(SUBSTRING(MD5('Alice'),0,16),SUBSTRING(SHA1('Alice'),0,16))` – Krzysiek Dec 22 '13 at 09:51
  • just like me then!! gonna try the solution. Thanks for your help – rupps Dec 22 '13 at 10:17

1 Answers1

0

MD5 collisions for random data (eg. usernames) are rare enough that you'd probably never see them. The problem is that MD5 has been broken with respect to collision resistance, so an attacker could easily generate a pair of usernames that have the same hash, with whatever security and/or functionality implications that would have for your design.

The usual way to generate a short identifier in your situation is to simply associate each username with a sequentially-generated number in the account database. The application uses the number internally, and only references the username when it needs to display something to a user.

Mark
  • 2,792
  • 2
  • 18
  • 31
  • Don't you think if the usernames can only contain a reduced set of chars and a max length It would be nearly impossible to find a collision that also matches the criteria? The thing is, in my situation, if I can rely on this md5 trick I have a lot of desirable benefits (like knowing IDs for any username on the client-side without ever asking db) – rupps Dec 22 '13 at 09:55
  • The reduced set of characters won't help much if you're using an 8-bit encoding such as ISO-8859-1 or Windows-1252 (maybe 15% of characters are invalid); it'll help even less for UTF-16 (maybe 5% invalid). – Mark Dec 22 '13 at 10:02
  • I may add a SHA1 part like @Krzysiek points – rupps Dec 22 '13 at 10:21