1

I have an example of PROC SURVEYSELECT where I created four groups containing five IDs in each group. I want to be able to take a random sample where the IDs in different stratifications (i.e. groups) do not overlap. How can I accomplish this? Note that each group has the same repeating ID - 1 and 2. The next three IDs are unique to the group.

Example code:

data survey;
input group $ id;
datalines;
a 1
a 2
a 3
a 4
a 5
b 1
b 2
b 6
b 7
b 8
c 1
c 2
c 9
c 10
c 11
d 1
d 2
d 12
d 13
d 14
;


proc surveyselect data=survey
method=srs n=3
out=MyStratExample;
strata group;
run;

proc print data=MyStratExample;
run;

current output:

a   1   0.6 1.6666666667
a   3   0.6 1.6666666667
a   4   0.6 1.6666666667
b   1   0.6 1.6666666667
b   2   0.6 1.6666666667
b   7   0.6 1.6666666667
c   1   0.6 1.6666666667
c   2   0.6 1.6666666667
c   11  0.6 1.6666666667
d   1   0.6 1.6666666667
d   2   0.6 1.6666666667
d   13  0.6 1.6666666667

We can observe that across the multiple groups SAS is taking samples of the same ID variable.

Joe
  • 62,789
  • 6
  • 49
  • 67
DukeLuke
  • 315
  • 6
  • 26
  • SAS sees the strata as separate groups, and isn't aware that you consider 'a 1' and 'b 1' to be the same record. Stratified sampling means it randomly selects from each strata (`group` in your data) but it considers all records for that group reasonable to select. If you consider `a 1` and `b 1` to be the same record, then you're probably going to have to roll your own sampling here; that's just not how SAS thinks of things (or how I'd think of things, either). I think that's too large of a topic for this site. – Joe May 23 '18 at 14:38
  • @Joe asking how to take multiple samples from a population without repeating the sample is that large of a topic? I didn’t know, is there another portion of this forum to do so? I would have thought this would be relatively easy. I know how it could be done through a bit of programming I just assumed the way I was thinking was too complicated and there had to be an easier way – DukeLuke May 23 '18 at 21:07
  • It’s that it would take a lot of work to do. Taking multiple samples is not a problem - you’re asking for a very specific implementation, not as far as I know implemented in the software directly. If someone knows a way to do it directly then they’re welcome to answer., but I think you’re asking for a significant amount of code and explanation. – Joe May 23 '18 at 21:31
  • Okay, that’s a good enough answer actually. I was thinking using a loop to get the initial one sample, checking for the same ids, taking another sample based on new size (n=) requirements.. etc until there are enough unique for the initial n= size requirement. – DukeLuke May 23 '18 at 21:40
  • 1
    Yep, basically you'd have to do something along those lines; I did something not totally different six or so years ago, and it was irritating. It's particularly hard if you're trying to be careful about probability of selection - most of the 'easy' solutions make the probability of selection not equal. – Joe May 23 '18 at 21:55
  • This appears to be a duplicate of an existing question that I answered some time ago: https://stackoverflow.com/questions/46243850/random-sampling-without-replacement-in-longitudinal-data – user667489 May 23 '18 at 21:59
  • 1
    The solution I posted for the other question is probably one of the 'easy' ones Joe is referring to, and even that one was quite a bit of code. You could use similar techniques to implement a more sophisticated sampling algorithm, though, so it might be worth a look. – user667489 May 23 '18 at 22:04

0 Answers0