I would like some help to make a code in awk that within 10,000 records would randomly choose 5,000.
Asked
Active
Viewed 58 times
-3
-
With or without repeats? What have you tried? – Jonathan Leffler Apr 30 '19 at 22:58
-
1Consider creating an array of 10,000 entries — 0..9999. Then use a Fisher-Yates shuffle to randomize the array. Then use the first 5,000 entries in the array. There are questions about Fisher-Yates shuffles on SO — and even if they're not using Awk, they shouldn't be hard to translate to Awk. (For example: [Shuffle array in C](https://stackoverflow.com/q/06127503) has pointers to useful references.) – Jonathan Leffler Apr 30 '19 at 23:00
-
Without repeats – Danilo Renato De Assis Apr 30 '19 at 23:17
-
See also [Is this C implementation of the Fisher-Yates shuffle correct?](https://stackoverflow.com/questions/3343797/is-this-c-implementation-of-fisher-yates-shuffle-correct) – Jonathan Leffler Apr 30 '19 at 23:29
-
4Why use awk? `shuf -n 5000 input.txt`. – Shawn Apr 30 '19 at 23:42
-
Or using perl: `perl -MList::Util=shuffle -ne 'push @lines, $_; END { print((shuffle @lines)[0..4999]) }' input.txt` – Shawn Apr 30 '19 at 23:53
-
Thanks. This solve my problem using shuffle – Danilo Renato De Assis May 01 '19 at 01:54
-
And what if you have multiline records, `shuf` will not help you there. – kvantour May 02 '19 at 15:05
-
Closely related: https://stackoverflow.com/questions/49978071 – kvantour May 02 '19 at 15:07
2 Answers
2
Sort has a randomizer.
Assuming an input filename of 10k
,
sort -R 10k | head -5000 > 5k # write selections to "5k"

Paul Hodges
- 13,382
- 1
- 17
- 36
0
The following method works for single as well as multi-line records or records with specific record-separators.
Define a script random_subset.awk
# Uniform(m) :: returns a random integer such that
# 1 <= Uniform(m) <= m
function Uniform(m) { return 1+int(m * rand()) }
# KnuthShuffle(m) :: creates a random permutation of the range [1,m]
function KnuthShuffle(m, i,j,k) {
for (i = 1; i <= m ; i++) { permutation[i] = i }
for (i = 1; i <= m-1; i++) {
j = Uniform(i-1)
k = permutation[i]
permutation[i] = permutation[j]
permutation[j] = k
}
}
BEGIN{ srand() }
{a[NR]=$0}
END{ KnuthShuffle(NR); for(r = 1; r <= count; r++) print a[permutation[r]] }
Then you can run it as:
$ awk -v count=5000 -f subset.awk inputfile > outputfile
Or if you have a file where the record separator is given by a character like @
, then you can do:
$ awk -v count=5000 -v RS='@' -v ORS='@' -f subset.awk inputfile > outputfile
If you want to select random paragraphs, you can do:
$ awk -v count=5000 -v RS='' -v ORS='\n\n' -f subset.awk inputfile > outputfile

kvantour
- 25,269
- 4
- 47
- 72