1

I have a fairly large set of data coming from an external source (via excel or csv). It has no unique key associated with each record. There is uniqueness of each row based on a set of 3-4 of the columns of data. I'm parsing this data and inserting it into a database.

What would be the best way to generate a hash code or some key based on these unique columns? I need it to be unique based on these columns because I need to compare it to another set of data from yet another source.

I could just concat the data and use that as the key but I'd prefer a smaller generated hash code (sha1, md5, whatever) to use as the key in the database when I'm loading the data.

I'm leaning towards using the Apache Commons DigestUtils and passing a String of the concatenated columns to generate a SHA1 code but I'm wondering if that's overkill.

Any suggestions? I'm not looking for super crypto secure - just something that will be unique to compare against.

Chromag
  • 31
  • 7
  • Are the strings the same length? You should consult this answer: http://stackoverflow.com/questions/2479348/is-it-possible-to-get-identical-sha1-hash – Grady G Cooper May 09 '15 at 03:55
  • Concatenating is a bad idea: 12, 34, 56 will generate the same hash as 1, 234, 56. Why not generate an sequential ID as a primary key, but compare the 3 columns when comparing with the other set of data? You don't need a cryptographic hash to compare 3 strings agains 3 other strings. – JB Nizet May 09 '15 at 06:32
  • The strings that are the unique columns are the same length and should be exactly the same. I could just compare the three strings against each other. There is a LOT of data involved here so I was curious if there was a faster way (like hashing the columns) of comparing the data from each source. – Chromag May 09 '15 at 15:44

0 Answers0