0

I have a peculiar scenario in which the sample obtained in two consecutive samplings are not consistent even when I've provided a seed value. I'm using the following code (Which was an outcome of a discussion here:

var conversionSample = sortedConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((conversionCount * sampleSize).toInt) 

var nonConversionSample = sortedNonConversionSubset.sample(true, (sampleSize + 0.05), 3*x).limit((nonConversionCount * sampleSize).toInt) 

Here

  1. 'sampleSize' is a constant fraction value less than 0.8
  2. 'x' is a constant int, which represents xth iteration in a for loop

  3. 'conversionCount' and 'nonConversionCount' are int values representing number of rows in each subset

Now the observation being that in two successive runs the sample generated is different in both cases which was not the expected behavior.

sortedConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_0|1         |
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|674e2337-aec5-434e-b56e-8c2efcc42894_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|7797aba3-3eea-4556-856e-753812b4b551_1|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|9b606693-4ffa-44a5-bd7c-cc6974ce3e83_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|e7dc7fd9-32df-46a1-b3bd-793bbda09f6f_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_0|1         |
|eaf434da-6a8f-4ab0-a744-62bea663ed5e_1|1         |
+--------------------------------------+----------+


sortedNonConversionSubset
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|0ed2e621-9ba4-46f0-8793-a84d32538c39_0|0         |
|0f9bcf42-e7fa-49a0-9d75-6c9bbc38b4d5_0|0         |
|108c5478-abc0-44d9-968b-47f81c4f5a37_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|13e3d779-026b-4d12-8619-aa5fe6ca99ed_0|0         |
|14497295-eebd-44aa-9f26-fc5e4810fb54_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|1911caf0-a470-4898-9b62-57c604422727_0|0         |
|1b91b8dc-09b8-47e2-b892-f5c14b650019_0|0         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|1e48e346-4ada-4a8d-896b-7658cc2499cd_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|2d4538a5-20e6-4742-ae46-aad0a5ed3fff_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|34442a3e-0437-4c41-86fb-1ac55062993a_0|0         |
|35151629-2f86-4917-90d2-42daa5ae4f5c_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4c3f11fa-e3ba-4eb1-977a-06f034bf8a54_0|0         |
|4ee484f4-e877-44c3-9390-c4e4072c5dee_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|57b47c74-b071-4278-89c9-f7b4cb1225d1_0|0         |
|58305773-f944-4039-8452-f5eb8d62f0cf_0|0         |
|58dfa9dd-43cf-4eb7-ade6-7235004a9815_0|0         |
|5b146218-9bb6-46f0-8c83-df131d78f591_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|64822b8c-009e-48ab-b6ca-1a7ece1106fa_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|73203f58-8be2-4716-b8f0-79c64400c57b_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8029c542-d933-43fb-b359-f2438dcd5660_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
|8fb43dff-260d-4ece-85e2-3bc2cb636ac1_0|0         |
|90f8a4cb-1956-43c4-ac7d-8c6514cd023a_0|0         |
|916f2e2a-6135-4004-8d54-d80b822ce394_0|0         |
|968a7ca3-1649-4586-9e60-b7e8565e708a_0|0         |
|a32782cc-8c4c-403b-aa83-09f1cec45fdb_0|0         |
|a63f44d5-a4d5-45a0-8a4b-cebf05df810b_0|0         |
|a6f958bc-e050-4216-b981-d51f1c0ff60d_0|0         |
|a7dba1bb-d7ff-44e6-9c4c-997ae59a2337_1|0         |
|ac33d675-d9cc-43b5-94fb-7d412773db14_0|0         |
|b1227816-9bf2-474f-8e82-5739acf6c895_0|0         |
|b1c27a2e-6efc-4869-880b-9ce0a4962edc_0|0         |
|b4ff6d43-cf0a-4f1d-9431-1edcb8ee1fb6_0|0         |
|b9e477ab-2065-42bb-832b-5d0e98ee05c7_0|0         |
|ba8c4efe-e71c-468c-b1bf-37efff596907_0|0         |
|c21eefc8-43d0-4be0-a252-b9fc4dbb7ad0_0|0         |
|c3785311-87c8-43bc-99a8-01d64f5eaa87_0|0         |
|c543bde7-deb8-4484-b0be-353c44baf6eb_1|0         |
|ca31e550-9d28-4628-bfe8-53648a2007f7_0|0         |
|cbc33697-20cb-4f8b-accd-0a6396a4ea41_0|0         |
|cc7810aa-08fc-44e7-acdc-ac948a28f9b9_0|0         |
|d1efdc7c-afb0-4995-bbbd-a76f731d2492_0|0         |
|d6a4b928-e576-41d7-9628-18709765199d_0|0         |
|d7311ec7-6c50-448d-8a6e-f690c3070d57_1|0         |
|d86b09f9-70a0-4101-a13b-129fe3a37b86_0|0         |
|d911be5b-aceb-45c8-a79e-73ccfa1b96f0_0|0         |
|db0c7b10-80f7-4071-aa53-fe0e2dc5ebce_0|0         |
|dce14c51-fa57-4e98-987d-708e2a9aa293_0|0         |
|dd026fb8-f818-4d1e-aaa4-4c9b3fd24994_0|0         |
|dfa9c55c-1e75-4010-be86-a6b1eb723672_0|0         |
|ea29f600-9e85-40f4-9f88-dcef46beb0c1_0|0         |
|eb5e58fc-eaac-4059-8ebc-1fab1ccf3555_1|0         |
|eb7568ab-83ac-45a7-bf4b-3b048d6c7c53_0|0         |
|f5b1cfc4-e397-4699-adab-0af6ee0e1b76_0|0         |
|facbfc8c-d477-4b27-bf15-52a56c26cbf6_0|0         |
|ffd03bca-ef40-4fa4-913e-73c002f29796_0|0         |
+--------------------------------------+----------+

1st Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|203865ed-f02a-4ed9-9098-82691de707a4_1|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_0|1         |
|6d6036d3-c161-4f5d-8557-80b85dd87bd9_1|1         |
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

2nd Run Sample
+--------------------------------------+----------+
|clientid                              |Conversion|
+--------------------------------------+----------+
|02438b66-2de4-4765-bae3-de7453647ea7_1|1         |
|7797aba3-3eea-4556-856e-753812b4b551_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_0|1         |
|870ab2a5-0650-42b8-9e6f-bde3859f64fd_1|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|be218b72-c664-40cf-adf5-e3519095e941_0|1         |
|1dfa820c-77e0-4927-8a39-ecd8e842b09b_0|0         |
|252be902-4204-40a5-9d3c-dd3a7d0f0355_0|0         |
|2995b49d-525b-43e9-ab36-8b8910a4607c_0|0         |
|2bc06b59-4624-4ddd-87a3-ed04cba88233_0|0         |
|31563716-9380-4662-90e5-7f63a1ab9072_0|0         |
|5ca3b5bc-35a9-42a5-bd37-a8fc94366dc6_0|0         |
|5d5f2ea0-aed9-4c2d-8c22-68859ec35e8e_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|5f9ebf92-3b1b-4628-b949-44a32e6d3659_0|0         |
|6b352714-af74-4773-854b-073e644e8684_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|6e528e49-472e-48c7-baa9-edc25303e427_0|0         |
|741630e0-1c99-497d-a127-5c4c562952c5_0|0         |
|03358d8f-9b9c-4258-9c99-234ab102c29b_1|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|040d213c-e91a-42f4-9bf7-90671670dc17_0|0         |
|04fe5148-1c56-4c88-aed0-1f01220bffd6_0|0         |
|129eb883-159d-49be-b8ae-9aa44a3e2919_0|0         |
|1855d96d-3647-4c4f-a20f-7e46f7635798_0|0         |
|3c37e066-dff5-4bd9-84ab-b9e73f3f3fdd_0|0         |
|3e998096-3a4b-4b57-a1de-69d2dbd19abd_0|0         |
|3f8ace3c-d378-4423-97a0-3d9cf35ba256_0|0         |
|49a0cfb8-490f-4252-84fa-2b9e250e9333_0|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|4fa035b3-dcd5-40e1-9107-0a0c943ff597_1|0         |
|529704d2-5a60-4718-a03f-639e040f6634_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|560f6978-028b-4a37-9f97-d97e93976bf7_0|0         |
|778e3b8a-2ca5-469a-9697-f646962e8308_0|0         |
|8b06ba24-2af3-4eec-811a-4d1779f37876_0|0         |
+--------------------------------------+----------+

The two samples being different could be a road blocker for me and just want to check how I could make these consistent

Neil Lunn
  • 148,042
  • 36
  • 346
  • 317
hbabbar
  • 947
  • 4
  • 15
  • 33
  • In general Spark doesn't provide stable sorting so if you sort by column which is not uniquer in each run values you sample may be different. – zero323 Feb 27 '17 at 04:39
  • @zero323 As per http://stackoverflow.com/questions/32229941/how-do-simple-random-sampling-and-dataframe-sample-function-work-in-apache-spark, I feel there is a mismatch in how the sampling should behave with seed. As per your comment here, is it safe to assume that we can not ensure that the exact same sample is picked up even after providing a seed value? – hbabbar Feb 27 '17 at 09:13
  • If upstream structure has non-deterministic order then you simply don't sample the same structure. You can try to confirm that by using `DataFrame` which has explicit order (like parallelized local collection). – zero323 Feb 27 '17 at 14:27

0 Answers0