Code to choose 5.000 numbers in 10.000 randonly

Question

I would like some help to make a code in awk that within 10,000 records would randomly choose 5,000.

Consider creating an array of 10,000 entries — 0..9999. Then use a Fisher-Yates shuffle to randomize the array. Then use the first 5,000 entries in the array. There are questions about Fisher-Yates shuffles on SO — and even if they're not using Awk, they shouldn't be hard to translate to Awk. (For example: [Shuffle array in C](https://stackoverflow.com/q/06127503) has pointers to useful references.) — Jonathan Leffler, Apr 30 '19 at 23:00
See also [Is this C implementation of the Fisher-Yates shuffle correct?](https://stackoverflow.com/questions/3343797/is-this-c-implementation-of-fisher-yates-shuffle-correct) — Jonathan Leffler, Apr 30 '19 at 23:29
Or using perl: `perl -MList::Util=shuffle -ne 'push @lines, $_; END { print((shuffle @lines)[0..4999]) }' input.txt` — Shawn, Apr 30 '19 at 23:53
And what if you have multiline records, `shuf` will not help you there. — kvantour, May 02 '19 at 15:05
Closely related: https://stackoverflow.com/questions/49978071 — kvantour, May 02 '19 at 15:07

Paul Hodges · Answer 1 · 2019-05-02T15:11:53.633

2

Sort has a randomizer.

Assuming an input filename of 10k,

sort -R 10k | head -5000 > 5k # write selections to "5k"

edited May 02 '19 at 15:11

answered May 01 '19 at 14:16

Paul Hodges

13,382
1
17
36

score 0 · Answer 2 · answered May 02 '19 at 15:17

The following method works for single as well as multi-line records or records with specific record-separators.

Define a script random_subset.awk

# Uniform(m) :: returns a random integer such that
#    1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }

# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m,   i,j,k) {
    for (i = 1; i <= m  ; i++) { permutation[i] = i }
    for (i = 1; i <= m-1; i++) {
        j = Uniform(i-1)
        k = permutation[i]
        permutation[i] = permutation[j]
        permutation[j] = k
    }
}

BEGIN{ srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR); for(r = 1; r <= count; r++) print a[permutation[r]] }

Then you can run it as:

$ awk -v count=5000 -f subset.awk inputfile > outputfile

Or if you have a file where the record separator is given by a character like @, then you can do:

$ awk -v count=5000 -v RS='@' -v ORS='@' -f subset.awk inputfile > outputfile

If you want to select random paragraphs, you can do:

$ awk -v count=5000 -v RS='' -v ORS='\n\n' -f subset.awk inputfile > outputfile

Code to choose 5.000 numbers in 10.000 randonly

2 Answers2