0

I am experiencing something weird.

I have a big ArrayList of Long numbers. It contains about 200k numbers in ascending order. These numbers are always distinct; they are not necessarily consecutive, but some groups of them usually are.

I want to extract a sorted sample of 5k from this list, so basically this is my approach:

  • I call java.util.Collections.shuffle(list);
  • I extract the first 5k elements from the now shuffled list
  • I sort the extracted elements in ascending order

My result is somewhat weird, though. Many of my extracted random Longs seem suspiciously close to each other, if not even consecutive. For instance, I got:

...
38414931,
38414932,
38414935,
38414937,
38414938,
38414939,
38414941,
...

This does not definitely look random :/

There is an even stranger thing. While debugging this, I tried to write into files both the initial list and the extracted sample in order to compare them. If I do this, my problem seems to disappear, and my extracted Longs look like proper random numbers.

I have repeated this many times, of course, and every time I did I experienced these two behaviours.

Am I missing something?

EDIT: Here is the code I am using:

List<Long> allNumbers = <getting my list>;

---> if here I write allNumbers into a file, it seems to work fine

Collections.shuffle(allNumbers);
HashSet<Long> randomNumbers = new HashSet<>();
for (int i = 0; i < 5000; i++) {
   randomNumbers.add(allNumbers.get(i));
}
Andrea Rossi
  • 981
  • 1
  • 10
  • 23
  • Point 3: *I sort the extracted elements in ascending order*. Don't do that. – Elliott Frisch Mar 07 '18 at 14:27
  • 1
    Why? I need my sample to be ordered in ascending order. This should not affect the fact that it contains numbers extracted randomly from the source list. – Andrea Rossi Mar 07 '18 at 14:27
  • 4
    please post a [mcve] – Juan Carlos Mendoza Mar 07 '18 at 14:28
  • 3
    Sounds like a real [Heisenbug](https://en.wikipedia.org/wiki/Heisenbug) (but more likely misinterpretation). Without your code, there is no way that people can answer your question. – Erwin Bolwidt Mar 07 '18 at 14:33
  • 1
    what happens if you use int distance = 10_000 or Random rnd as source of Randomness in the shuffle method? – hovanessyan Mar 07 '18 at 14:38
  • How do you produce the random numbers? Do you produce an even distribution? – user unknown Mar 07 '18 at 14:44
  • 2
    @AndreaRossi The code you have posted looks ok - so the problem is most likely in the code you have NOT posted. You need to create a [mcve]. – assylias Mar 07 '18 at 14:46
  • Hi @JuanCarlosMendoza. The whole problem is sampling a big list, so I don't think I can provide a minimal example (with a small list the problem would probably be invisible). (Also, my data are protected by NDA so even if there was a small list of numbers that showed the problem effectively I am not sure I could share it...). I'm really sorry about that, I know it makes everything much harder :( – Andrea Rossi Mar 07 '18 at 14:49
  • 3
    If you haven't performed any tests for randomness, statements like "This does not definitely look random" make no sense. For more details you can check https://softwareengineering.stackexchange.com/questions/147134/how-should-i-test-randomness – TheJavaGuy-Ivan Milosavljević Mar 07 '18 at 15:02
  • 1
    Thank you @FedericoPeraltaSchaffner! I am now using this approach and it works very well :) – Andrea Rossi Mar 07 '18 at 15:56
  • 1
    @FedericoPeraltaSchaffner That works, but because of the [Birthday Problem](https://stackoverflow.com/questions/29009067/treeset-not-adding-all-elements/29023849#29023849) it will almost certainly result in duplicates. Not sure if that's ok for the OP. If it isn't, then I'd suggest using the algorithm described in [this answer](https://stackoverflow.com/questions/28651908/perform-operation-on-n-random-distinct-elements-from-collection-using-streams-ap/28655112#28655112) to choose 5k values from the 200k inputs. – Stuart Marks Mar 07 '18 at 23:03
  • 1
    @StuartMarks will study that problem, thank you very much for your kind feedback – fps Mar 07 '18 at 23:21

1 Answers1

1

Here is a Minimal, Complete, and Verifiable example for you, that outputs some random, increasing numbers, as you expect. Note that my code is the same as yours, except for the input part. So either your problem is in the code you haven't shown yet or the output is fine even if there are sequences of consecutive numbers, which you would expect even with a random distribution.

public static void main(String[] args) {
  List<Long> allNumbers = new ArrayList<>();
  for (long i = 0; i < 2_000; i++) allNumbers.add(i);

  Collections.shuffle(allNumbers);
  Set<Long> randomNumbers = new HashSet<>();

  for (int i = 0; i < 50; i++) randomNumbers.add(allNumbers.get(i));

  randomNumbers.stream().sorted().forEach(n -> System.out.print(n + " "));
}

Example output:

30 149 233 255 301 357 361 391 412 413 423 480 481 ...

assylias
  • 321,522
  • 82
  • 660
  • 783