Here's a wee bash function for you. It grabs, as you say, a "batch" of lines, with a random start point within a file.
randline() {
local lines c r _
# cache the number of lines in this file in a symlink in the temp dir
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
# Pick a random number...
r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
echo "start=$r" >&2
# And start displaying $2 lines before that number.
head -n $r "$1" | tail -n ${2:-1}
}
Edit the echo
lines as required.
This solution has the advantage of fewer pipes, less resource-intensive pipes (i.e. no | sort ... |
), less platform dependence (i.e. no sort -R
which is GNU-sort-specific).
Note that this relies on Bash's $RANDOM
variable, which may or may not actually be random. Also, it will miss lines if your source file contains more than 32768^2 lines, and there's an failure edge case if the number of lines you've specificed (N) is >1 and the random start point is less than N lines from the beginning. Solving that is left as an exercise for the reader. :)
UPDATE #1:
mklement0 asks an excellent question in comments about potential performance issues with the head ... | tail ...
approach. I honestly don't know the answer, but I would hope that both head
and tail
are optimized sufficiently that they wouldn't buffer ALL input prior to displaying their output.
On the off chance that my hope is unfulfilled, here's an alternative. It's an awk-based "sliding window" tail. I'll embed it in the earlier function I wrote so you can test it if you want.
randline() {
local lines c r _
# Line count cache, per the first version of this function...
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
echo "start=$r" >&2
# This simply pipes the functionality of the `head | tail` combo above
# through a single invocation of awk.
# It should handle any size of input file with the same load/impact.
awk -v lines=${2:-1} -v count=0 -v start=$r '
NR < start { next; }
{ out[NR]=$0; count++; }
count > lines { delete out[start++]; count--; }
END {
for(i=start;i<start+lines;i++) {
print out[i];
}
}
' "$1"
}
The embedded awk script replaces the head ... | tail ...
pipeline in the previous version of the function. It works as follows:
- It skips lines until the "start" as determined by earlier randomization.
- It records the current line to an array.
- If the array is greater than the number of lines we want to keep, it eliminates the first record.
- At the end of the file, it prints the recorded data.
The result is that the awk process shouldn't grow its memory footprint because the output array gets trimmed as fast as it's built.
NOTE: I haven't actually tested this with your data.
UPDATE #2:
Hrm, with the update to your question that you want N random lines rather than a block of lines starting at a random point, we need a different strategy. The system limitations you've imposed are pretty severe. The following might be an option, also using awk, with random numbers still from Bash:
randlines() {
local lines c r _
# Line count cache...
lines="/tmp/${1//\//-}.lines"
if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
c=$(ls -l "$lines" | sed 's/.* //')
else
read c _ < <(wc -l $1)
ln -sfn "$c" "$lines"
fi
# Create a LIST of random numbers, from 1 to the size of the file ($c)
for (( i=0; i<$2; i++ )); do
echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
done | awk '
# And here inside awk, build an array of those random numbers, and
NR==FNR { lines[$1]; next; }
# display lines from the input file that match the numbers.
FNR in lines
' - "$1"
}
This works by feeding a list of random line numbers into awk as a "first" file, then having awk print lines from the "second" file whose line numbers were included in the "first" file. It uses wc
to determine the upper limit of the random numbers to generate. That means you'll be reading this file twice. If you have another source for the number of lines in the file (a database for example), do plug it in here. :)
A limiting factor might be the size of that first file, which must be loaded into memory. I believe that the 30000 random numbers should only take about 170KB of memory, but how the array gets represented in RAM depends on the implementation of awk you're using. (Though usually, awk implementations (including Gawk in Ubuntu) are pretty good at keeping memory wastage to a minimum.)
Does this work for you?