read file only once for Stratified sampling

Question

If do not know the distribution (or size/probability) of each subpopulation (stratum), and also not know the total population size, is it possible to do Stratified sampling by reading file only once? Thanks.

https://en.wikipedia.org/wiki/Stratified_sampling

regards, Lin

If you know the total number of samples k you will take from all strata (but not the number in each stratum), could you do reservoir sampling with k items in each strata (the maximum), then work out how many you need from each strata and randomly sample each reservoir? — samgak, Jun 07 '16 at 05:44
@samgak, smart idea. Vote up. Any thoughts we can even improve by not reading k elements from each strata in first pass? — Lin Ma, Jun 07 '16 at 17:34

mcdowella · Accepted Answer · 2016-06-09T18:27:45.143

1

Assuming that each record in the file can be identified as being in a particular sub-population, and that you know ahead of time what size of random sample you want from that sub-population you could hold, for each sub-population, a datastructure allowing you to do Reservoir Sampling, for that sub-population (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R).

So repeatedly:

Read a record

Find out which sub-population it is in and get the datastructure representing the reservoir sampling for that sub-population, creating it if necessary.

Use that data-structure and the record read to do reservoir sampling for that sub-population.

At the end you will have, for each sub-population seen, a reservoir sampling data-structure containing a random sample from that population.

For the case when you wish to end up with k of N samples forming a stratified sample over the different classes of records, I don't think you can do much better than keeping k of each class and then downsampling from this. Suppose you can and I give you a initial block of records organised so that the stratified sample will have less than k/2 of some class kept. Now I follow that block with a huge number of records, all of this class, which is now clearly underrepresented. In this case, the final random sample should have much more than k/2 from this class, and (if it is really random) there should be a very small but non-zero probability that more than k/2 of those randomly chosen records came from the first block. But the fact that we never keep more than k/2 of these records from the first block means that the probability with this sampling scheme is exactly zero, so keeping less than k of each class won't work in the worst case.

Here is a cheat method. Suppose that instead of reading the records sequentially we can read the records in any order we chose. If you look through stackoverflow you will see (rather contrived) methods based on cryptography for generating a random permutation of N items without holding N items in memory at any one time, so you could do this. Now keep a pool of k records so that at any time the proportions of the items in the pool are a stratified sample, only adding or removing items from the pool when you are forced to do this to keep the proportions correct. I think you can do this because you need to add an item of class X to keep the proportions correct exactly when you have just observed another item of class X. Because you went through the records in a random order I claim that you have a random stratified sample. Clearly you have a stratified sample, so the only departure from randomness can be in the items selected for a particular class. But consider the permutations which select items not of that class in the same order as the permutation actually chosen, but which select items of that class in different orders. If there is bias in the way that items of that class are selected (as there probably is) because the bias will affect different items of that class in different ways depending on what permutation is selected the result of the random choice between all of these different permutations is that the total effect is unbiassed.

edited Jun 09 '16 at 18:27

answered Jun 07 '16 at 04:15

mcdowella

19,301
2
19
25

Thanks mcdowella for the reply and vote up, I know Reservoir sampling, and I think it only works (means read file only once) if we know how many records we need to sample from each particular sub-population? If we do not know how many records we need to sample from each particular sub-population in advance, is there a solution which read file only once for Stratified sampling? Thanks. – Lin Ma Jun 07 '16 at 04:19
1

If you plan to make each stratified sample of size, for instance, proportional to its standard deviation, I don't see how you can do this before you see all the samples because the standard deviation could be increased to almost any value by outliers appearing in the last few records. If you can put an upper bound to the size of each set of samples you could use my answer above to select samples of this size and then sample within those samples if you wished to reduce these sizes after seeing all the data. – mcdowella Jun 07 '16 at 17:49
Thanks mcdowella, vote up for your comments. Confused by what do you mean "proportional to its standard deviation", in my understanding, stratified sampling just random sample inside each strata, so it should be proportion to strata size other than strata standard deviation? You can refer to this sample, https://en.wikipedia.org/wiki/Stratified_sampling#Strata_size_calculation – Lin Ma Jun 07 '16 at 18:23
1

When I have read about stratified sampling the sizes of the subsamples were determined long before the samples were taken, using information from previous samples or educated guesses. In this case you would be able to use reservoir sampling. However I noticed that in the article paragraph numbered 2 starting "Optimum allocation (or disproportionate allocation)" it talks about calculating samples sizes based on standard deviation. But now I'm not sure that this makes sense unless you use standard deviation based not on the current information but on some previous survey. – mcdowella Jun 07 '16 at 19:26
Thanks mcdowella, good point for the method of using standard deviation based allocation. Actually I am asking for the first method (strata size based allocation), I should make it clear. In terms of standard deviation based allocation, my question is, how do you decide how many samples to allocate from each strata according to standard deviation value? Standard deviation 2 v.s. standard deviation 1, is quite different from standard deviation 200 v.s. standard deviation 100, even though they are 2 times relationship from numeric ratio perspective. – Lin Ma Jun 08 '16 at 06:02
(cont'd) in my sample above, will you always sample twice in the strata which has standard deviation 2/200, v.s. the strata which has standard deviation 1/100? Thanks. – Lin Ma Jun 08 '16 at 06:03
1

For optimal allocation the percentage of samples taken within each stratum should be proportional to the standard deviation, assuming the cost of sampling is the same everywhere. You can work this out by applying Lagrange optimization to the formula for the variance of the final estimate. See slide/P 49 of http://www4.ncsu.edu/~pollock/pdfs/ST%20432%20Stratified%20Random%20Sampling.pdf - or try a web search: I searched on optimum stratified sample size. – mcdowella Jun 08 '16 at 17:48
Thanks mcdowella, vote up. Wondering whether method 1 -- strata population size based sampling is used more often or method 2 -- optimal optimization is used more often? I could be wrong, but when I see people doing stratified sampling, I see almost for all the time they use method 1. Please feel free to correct me. :) – Lin Ma Jun 08 '16 at 21:53
1

I don't have practical experience of this. I was recommended "Survey Sampling" by Kish as a textbook. This suggests (section 3.5) that optimum sampling is normally only used when there are substantial differences between the population variances because otherwise the extra complexity is not worth it. It also notes that the variance achieved near the optimum is very similar to the optimum variance, so it is possible to round to convenient proportions near the optimum or using only a few different proportions (some of this advice predates cheap and ubiquitous computer support). – mcdowella Jun 09 '16 at 04:16
Thanks mcdowella, vote up. Let us come back to the original question, suppose we are doing strata size based stratified sampling, how to read file only once for the task, if we do not know whole file size and we also do not know each strata size/proportion in advance? Suppose we want to sample total k elements from N records in the file. The only method I have so far is to sample fixed number k elements for each strata and then remove unnecessary elements from each strata after we know strata proportion after reading file once, your advice is appreciated. – Lin Ma Jun 09 '16 at 05:16
(cont'd) if you have any smarter ideas to not sample fixed k-elements for each strata in first pass of reading file. Thanks. – Lin Ma Jun 09 '16 at 05:17
1

I have added to the original answer to show why I think that you cannot do better than keeping k elements of each class if you read the records in sequential order, but if you are allowed to read the records from the file in a random order you need only keep the total of k records that you need to provide as an answer. – mcdowella Jun 09 '16 at 18:29
Thanks mcdowella, vote up for your reply and appreciate for your further thoughts and patience to explain. For your comments, "methods based on cryptography for generating a random permutation of N items without holding N items in memory at any one time", do you mean reservoir sampling (https://en.wikipedia.org/wiki/Reservoir_sampling), or do you mean something else? Thanks. – Lin Ma Jun 09 '16 at 20:09
1

I mean questions like http://stackoverflow.com/questions/32182120/to-generate-random-permutation-of-a-array-in-on-time-and-o1-space and http://stackoverflow.com/questions/10054732/create-a-random-permutation-of-1-n-in-constant-space one of which points to http://blog.notdot.net/2007/9/Damn-Cool-Algorithms-Part-2-Secure-permutations-with-block-ciphers – mcdowella Jun 10 '16 at 04:26
Thanks mcdowella, read through and further thinking your comments ober the weekend, vote up for your reply, and for your edited comments of paragraph "For the case when you wish to end up with k of N samples", I agree we cannot estimate how much more population from a certain class will come in the future blocks of a file, but I think what we can do is as you said in 2nd paragraph of new edited reply, dynamically adjust how many population from a class to keep in memory. So, I do not fully agree we do not have a smarter – Lin Ma Jun 12 '16 at 23:29
(cont'd) method of keeping less than k-record, and I think there could be some solution which dynamically adjust how many to keep in memory other than blindly keep k elements for each class. Do you have any further ideas how to keep less record in memory other than keep k-elements for each class blindly (I mean other than using the cryptography method)? Thanks. – Lin Ma Jun 12 '16 at 23:31
1

Find a flaw in my proof or change your requirements to make it feasible to run with less memory. For instance, if you counted the distribution of classes in a small random subsample you would get an approximate idea of the number of each class in the full set. Then use this to work out how many of each class to keep in a pass through the full data. If you add a small safety margin then with high probability you will have kept enough of each class to subsample down to a stratified sample - and, if not the increase in variance because your sample sizes were not quite right would be small. – mcdowella Jun 13 '16 at 04:16
Thanks mcdowella, vote up and in theory your advice works, but consider the distribution of samples are not evenly distributed (your sample of elements from a specific class always comes later in the file), how to decide the safety boundary? Any more details are appreciated. – Lin Ma Jun 13 '16 at 22:10
Thanks for all the help mcdowella, mark your reply as answer. – Lin Ma Jul 27 '16 at 23:18

score 1 · Answer 2 · answered Jun 08 '16 at 05:21

1

To do sampling in a single pass is simple, if you are able to keep the results in memory. It consists of two parts:

Calculate the odds of the new item being part of the result set, and use a random number to determine if the item should be part of the result or not.
If the item is to be kept, determine whether it should be added to the set or replace an existing member. If it should replace an existing member, use a random number to determine which existing member it should replace. Depending on how you calculate your random numbers, this can be the same one as the previous step or it can be a new one.

For stratified sampling, the only modification required for this algorithm is to determine which strata the item belongs to. The result lists for each strata should be kept separate.

answered Jun 08 '16 at 05:21

Mark Ransom

299,747
42
398
622

Thanks Mark, vote up for your reply. I think your method works when we do random sampling uniformly across the whole file records. For stratified sampling, if we do not know the size of each strata, how do you decide (using odds method you mentioned) whether we should keep a record or not for a specific strata? If you read comments from @samgak, you can find a method which always sample k (over-sample), and after read file once and knowing each strata size, we will remove over-sampled ones in each strata. – Lin Ma Jun 08 '16 at 06:07
(cont'd) Actually I am asking if any thoughts we can even improve by not reading k elements from each strata in first pass? (suppose k is the total number of samples we need from all strata). If I understand your described method wrong, please feel free to correct me. :) – Lin Ma Jun 08 '16 at 06:08

read file only once for Stratified sampling

2 Answers2