1

I'm helping a colleague who has been asked to generate a key ID for two different groups of data coming in. I've completed this step but it's not very user friendly so I'm looking for suggestions on how to make it more readable. Each group has its own ID that appears to be a hexadecimal value. The concatenation of them appears to be a unique key in its own right. In this case, the Household table and the Account table are being brought together and she has been asked to generate at Household-Account value (a household can have many accounts, an account can span households).

Our data is stored on SQL server but we do most of our manipulations using SAS, hence, PROC SQL below.

My initial thought was that the most obvious key is to run the two key fields together and use a delimiter. You'll see that in the top portion of my code. However this makes a very long field so I was asked to shorten it. My second thought, and their initial ask, was to just do an integer field. You can see that with the Monotonic but they felt that since it has warnings about it around the internet they don't trust it. My third thought was to run the existing, concatenated field through some kind of one-way function but when I do that (see MD5 below) I get something that looks like wingdings took over.

/*  creating a table of just the "key" columns */
PROC SQL;
    CREATE TABLE work.ConcatonatedKey AS
        SELECT DISTINCT 
           CATX("G", HouseholdKey,FinancialKey) as Concatonated
        FROM work.OriginalData
    ;
QUIT;

/*  Populate HHFinancialKey */
/* Monotonic documentation */
/*  http://support.sas.com/techsup/notes/v8/15/138.html  */
PROC SQL;
    CREATE TABLE work.ContrivedKeys AS
        SELECT 
              Monotonic() AS HHFinID
            , Concatonated
            , MD5(Concatonated) As foo
        FROM work.ConcatonatedKey
    ;
QUIT;

So, the real question here is, if you had something that could uniquely ID a row but wanted to make it more user friendly, using SAS, how would you go about it. ?

  • Are you *sure* this is SQL Server? `MD5` is not a T-SQL Function, and neither is `CATX` or `Monotonic` and `QUIT` is not a T-SQL operator. Either way, delimited values in an RDBMS is a bad idea; if you want 2 values to be the primary key use 2 columns and a composite key. – Thom A Jun 03 '19 at 17:03
  • SQL Server is where the data is stored. What you're seeing above is PROC SQL as done in SAS. Going to edit to clarify the tags. – Michael Amos Jun 03 '19 at 17:07

2 Answers2

2

The SAS UUIDGEN function can return either human readable character string or a denser binary string. Per docs:

The UUIDGEN function returns a UUID (a unique value) for each cell. The default result is 36 characters long and it looks like:

5ab6fa40-426b-4375-bb22-2d0291f43319.

A binary result is 16 bytes long.

Example:

select
  ... 
  uuidgen() as myGroupId length=36
...
Richard
  • 25,390
  • 3
  • 25
  • 38
1

MD5 is probably the simplest solution. The MD5 function returns a 16 byte string as a result, but to make it human readable you can just format it using the $hex32. format. It's also very fast and widely supported.

data _null_;
  x = put(md5("some_string_here"),$hex32.);
  put x;
run;

Result:

BB28824D60AE6706F812CC940CAAAF1B

Just be careful that md5() is sensitive to case differences, and leading/trailing spaces. So you may want/need to upppercase everything and trim spaces prior to running it through the function to get consistent results across different platforms.

The risk of collisions is close to zero:

How many random elements before MD5 produces collisions?

Should also note that, knowing the two unhashed keys used to create the hash, you can recreate the hash from the keys, something that isn't possible with the uuidgen solution selected as the answer. Depending on your requirements this may or may not be a requirement.

Robert Penridge
  • 8,424
  • 2
  • 34
  • 55
  • 1
    @Lee from the question it seems unlikely that the asker tried applying the `$hex32.` format, so this answer still adds value. – user667489 Jun 04 '19 at 09:42