3

I'm fairly new to Matlab, although not to programming. I'm trying to hash a string, and get back a single value that acts as a unique id for that string. I'm using this DataHash function from FileExchange which returns the hash as an integer vector. So far the best solution I've found for converting this to a single numeric value goes:

hash_opts.Format = 'uint8';
hash_vector = DataHash(string, hash_opts);
hash_string = num2str(hash_vector);
% Use a simple regex to remove all whitespace from the string,
% takes it from '1 2 3 4' to '1234'
hash_string = regexprep(hash_string, '[\s]', '');
hashcode = str2double(hash_string);

A reproducible example that doesn't depend on DataHash:

hash_vector = [1, 23, 4, 567];
hash_string = num2str(hash_vector);
% Use a simple regex to remove all whitespace from the string,
% takes it from '1 2 3 4' to '1234'
hash_string = regexprep(hash_string, '[\s]', '');
hashcode = str2double(hash_string); % Output: 1234567

Are there more efficient ways of achieving this, without resorting to a regex?

Marius
  • 58,213
  • 16
  • 107
  • 105

2 Answers2

7

Yes, Matlab's regex implementation isn't particularly fast. I suggest that you use strrep:

hashcode = str2double(strrep(hash_string,' ',''));

Alternatively, you can use a string creation method that doesn't insert spaces in the first place:

hash_vector = [1, 23, 4, 567];
hash_string = str2double(sprintf('%d',hash_vector))

Just make sure that your hash number is less than 2^53 or the conversion to double might not be exact.

Community
  • 1
  • 1
horchler
  • 18,384
  • 4
  • 37
  • 73
3

I'v seen there's already an answer - though it loses precission as it omits leading 0s - I'm not really sure if it will cause you troubles but I wouldn't want to rely on it.

As you output as uint8 why don't you use hex values instead - this will give you the exactly same number. Converting back is also easy using dec2hex.

hash_vector = [1, 23, 4, 253]
hash_str=sprintf('%02x',hash_vector); % to assure every 8 bit use 2 hex digits!
hash_dig=hex2dec(hash_str)

btw. - your sampe hash contains 567 - an impossible number in uint8.


Having looked at DataHash the question would also be why not use base64 or hex in the first place.

bdecaf
  • 4,652
  • 23
  • 44
  • Thanks, I tried to include some context because I knew there were multiple points I might have gone wrong, it's probably more sensible to use the hex values write from the start. – Marius May 27 '13 at 08:27