2

I need to design a way to provide a hash for every document stored in my application.

Using existing hash libraries (BCrypt, etc) and even BSON ObjectId generates nice "hash" or "key" but its quite long.

I also understand that the only way to achieve short hash, is to hash fewer strings (if not mistaken). Like hash Long id's staring from 0, 1, 2, 3 and so on.

However it is easy to think of, its fairly hard to implement in the Google App Engine (GAE) Datastore, or I haven't really crossed this need until now.

The GAE Datastore store entities across severs and even across datacenters and auto-increment ID is not really for this.

What could be the strategy to achieve this?

Dan McGrath
  • 41,220
  • 11
  • 99
  • 130
quarks
  • 33,478
  • 73
  • 290
  • 513
  • 1. some hash libraries like md5 and sha1 will always produces same length 2. why do you want short hash? – marcadian Apr 09 '13 at 19:46
  • MD5 would be ok if GAE datastore is good at doing ID auto-increment –  Apr 09 '13 at 20:25
  • You are mistaken. A message digest algorithm takes an arbitrary-length byte array as input and returns a fixed-length byte array as output (usually around 16 or 20 bytes) – JB Nizet Apr 10 '13 at 21:14

1 Answers1

3

As far as I understand you are looking for a way to generate short, unique, alphanumeric identifiers for you documents. The kind of thing URL shorteners do (see questions Making a short URL similar to TinyURL.com or What's the best way to create a short hash, similiar to what tiny Url does? or How to make unique short URL with Python?, etc.). My answer is based on this assumption.

The datastore generates unique auto-incremented IDs so you can rely on that. Multiple data centers are not a problem, your IDs will be unique, short (at least, initially) and there is no collision. This is probably how tinyurl and similar services accomplish it.

You can even request one or more unique IDs before you persist your new document in the datastore by using the DatastoreService.allocateIds(), for example:

KeyRange keyRange = dataService.allocateIds("MyDocumentModel", 1);
long uniqueId = keyRange.getStart().getId();

You can then "hash" this ID or you could get an even shorter alphanumeric ID by simply transcoding the integer ID to Base64 (or Base36 or some other base where you define your own characters, e.g., omitting vowels can help you avoid generating obvious swear words accidentally).

If predictability is an issue you can prefix/suffix this alphanumeric ID with some random characters.

Community
  • 1
  • 1
zengabor
  • 2,013
  • 3
  • 23
  • 28
  • Yes, exactly this is what I am actually trying to do. However when you say auto-increment? Are you saying when using JDO or Objectify? Because creating raw entities you need to put a kind + key combination – quarks Apr 11 '13 at 14:27
  • E.g., `AdminDatastoreService.allocateIds("MyKind", 5)` gives you the next 5 unique keys generated by the datastore for the `MyKind` model. Inside these keys there are the unique integer IDs. – zengabor Apr 11 '13 at 14:34
  • Sorry, here is the correct link: https://developers.google.com/appengine/docs/java/javadoc/com/google/appengine/api/datastore/DatastoreService#allocateIds%28java.lang.String,%20long%29 – zengabor Apr 11 '13 at 14:45
  • You said unique integer IDs, however ID's are not unique enough right, the real unique key is the Datastore key itself, am I right? – quarks May 04 '13 at 17:23
  • The datastore is generating auto-incremented unique IDs for each model. This integer ID is guaranteed to be unique for that model for your application across all instances. So, if you always use the integer ID for this specific model (e.g., `MyDocumentModel`) then the entire key will be unique. I am using the same technique in my own application. – zengabor May 05 '13 at 05:14
  • The generated hash must be also unique, of course. So a simple transcode to Base64 is probably a good idea. – zengabor May 05 '13 at 05:18