SQL sampling Natural key vs Surrogate key

Question

I have a table with the natural key with VARCHAR datatype. The incoming feed could be a VARCHAR or BIGINT natural key but stored as VARCHAR datatype. We generate the surrogate keys using the natural keys. The incoming daily feed has gaps and islands on the natural keys but eventually filled over the course of catchup. However, the surrogates generated on top of the natural keys during the loading process doesn't guarantee the same ordinality due to intermittent gaps in the numbers.

Now, I need to include a sampling mechanism on the dataset to load only 5% of the volume and cascading it to the daily catchups as well. Since the natural key is a VARCHAR column I cannot provide a sampling using MOD function. But, if I use the Surrogate key column for the sampling, then over the period of time, the dataset grows beyond the expected 5%.

For example,

Day    NK       SK  
1      1        1
1      3        2
2      2        3
3      4        4
3      5        5
3      7        6
4      6        7
.      .        .
.      .        .
.      .        .

Is there a cleaner way to circumvent this?

Is there a reason you can't cast the NK as a BIGINT, then use MOD on that? — mc110, Feb 20 '18 at 09:37
@mc110 There are cases where it could really turn out to be a VARCHAR. So casting will break at some point. — Srini V, Feb 20 '18 at 09:59
I'm not sure how much inaccuracy you get with the SK, but it sounds like you want to sample based on the NK. You could try using one of the hash functions from the hashes plugin against the NK, then using MOD on that - worth trying all the possible hashes in case some perform better against your range of NK values than others, in terms of generating a fair 5% sample. — mc110, Feb 22 '18 at 09:09

SQL sampling Natural key vs Surrogate key

0 Answers0