I have a table with the natural key with VARCHAR
datatype. The incoming feed could be a VARCHAR
or BIGINT
natural key but stored as VARCHAR
datatype. We generate the surrogate keys using the natural keys. The incoming daily feed has gaps and islands on the natural keys but eventually filled over the course of catchup. However, the surrogates generated on top of the natural keys during the loading process doesn't guarantee the same ordinality due to intermittent gaps in the numbers.
Now, I need to include a sampling mechanism on the dataset to load only 5% of the volume and cascading it to the daily catchups as well. Since the natural key is a VARCHAR column I cannot provide a sampling using MOD function. But, if I use the Surrogate key column for the sampling, then over the period of time, the dataset grows beyond the expected 5%.
For example,
Day NK SK
1 1 1
1 3 2
2 2 3
3 4 4
3 5 5
3 7 6
4 6 7
. . .
. . .
. . .
Is there a cleaner way to circumvent this?