-3

I would like some help to make a code in awk that within 10,000 records would randomly choose 5,000.

2 Answers2

2

Sort has a randomizer.

Assuming an input filename of 10k,

sort -R 10k | head -5000 > 5k # write selections to "5k"
Paul Hodges
  • 13,382
  • 1
  • 17
  • 36
0

The following method works for single as well as multi-line records or records with specific record-separators.

Define a script random_subset.awk

# Uniform(m) :: returns a random integer such that
#    1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }

# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m,   i,j,k) {
    for (i = 1; i <= m  ; i++) { permutation[i] = i }
    for (i = 1; i <= m-1; i++) {
        j = Uniform(i-1)
        k = permutation[i]
        permutation[i] = permutation[j]
        permutation[j] = k
    }
}

BEGIN{ srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR); for(r = 1; r <= count; r++) print a[permutation[r]] }

Then you can run it as:

$ awk -v count=5000 -f subset.awk inputfile > outputfile

Or if you have a file where the record separator is given by a character like @, then you can do:

$ awk -v count=5000 -v RS='@' -v ORS='@' -f subset.awk inputfile > outputfile

If you want to select random paragraphs, you can do:

$ awk -v count=5000 -v RS='' -v ORS='\n\n' -f subset.awk inputfile > outputfile
kvantour
  • 25,269
  • 4
  • 47
  • 72