5

How can I get n random lines from very large files that can't fit in memory.

Also it would be great if I could add filters before or after the randomization.


update 1

in my case the specs are :

  • > 100 million lines
  • > 10GB files
  • usual random batch size 10000-30000
  • 512RAM hosted ubuntu server 14.10

so losing a few lines from the file won't be such a big problem as they have a 1 in 10000 chance anyway, but performance and resource consumption would be a problem

Stefan Rogin
  • 1,499
  • 3
  • 25
  • 41
  • Just to provide clarification: judging by your own answer it seems that when you say "batches of n lines", you mean `n` lines selected randomly _individually_, _not_ a _block of `n` contiguous lines_ from a random starting point. – mklement0 Mar 17 '15 at 20:54
  • yes, probably "batches" wasn't the best expression to use, sorry :) – Stefan Rogin Mar 18 '15 at 06:02
  • Hmm. Thanks for the clarification. Unfortunately, it means my answer won't work for you. I'll update my answer, but can you give me an idea of what values of `n` you'd expect? How many of the >100million lines would you want to display? ALSO, what kind of "filters" do you want to add before/after randomization? – ghoti Mar 18 '15 at 07:05
  • usually random batches of 10-30k – Stefan Rogin Mar 18 '15 at 07:35
  • Please see update #2 in my answer. – ghoti Mar 18 '15 at 07:55
  • Looks like "Convert to a database" is the best answer: the file format is simply more adapted to the task than newline separated human readable entries. – Ciro Santilli OurBigBook.com Nov 20 '15 at 09:57

5 Answers5

8

In such limiting factors, the following approach will be better.

  • seek to random position in the file (e.g. you will be "inside" in some line)
  • go backward from this position and find the start of the given line
  • go forward and print the full line

For this you need a tool that can seek in files, for example perl.

use strict;
use warnings;
use Symbol;
use Fcntl qw( :seek O_RDONLY ) ;
my $seekdiff = 256; #e.g. from "rand_position-256" up to rand_positon+256

my($want, $filename) = @ARGV;

my $fd = gensym ;
sysopen($fd, $filename, O_RDONLY ) || die("Can't open $filename: $!");
binmode $fd;
my $endpos = sysseek( $fd, 0, SEEK_END ) or die("Can't seek: $!");

my $buffer;
my $cnt;
while($want > $cnt++) {
    my $randpos = int(rand($endpos));   #random file position
    my $seekpos = $randpos - $seekdiff; #start read here ($seekdiff chars before)
    $seekpos = 0 if( $seekpos < 0 );

    sysseek($fd, $seekpos, SEEK_SET);   #seek to position
    my $in_count = sysread($fd, $buffer, $seekdiff<<1); #read 2*seekdiff characters

    my $rand_in_buff = ($randpos - $seekpos)-1; #the random positon in the buffer

    my $linestart = rindex($buffer, "\n", $rand_in_buff) + 1; #find the begining of the line in the buffer
    my $lineend = index $buffer, "\n", $linestart;            #find the end of line in the buffer
    my $the_line = substr $buffer, $linestart, $lineend < 0 ? 0 : $lineend-$linestart;

    print "$the_line\n";
}

Save the above into some file such "randlines.pl" and use it as:

perl randlines.pl wanted_count_of_lines file_name

e.g.

perl randlines.pl 10000 ./BIGFILE

The script does very low-level IO operations, i.e. it is VERY FAST. (on my notebook, selecting 30k lines from 10M took half second).

mklement0
  • 382,024
  • 64
  • 607
  • 775
clt60
  • 62,119
  • 17
  • 107
  • 194
  • Great solution that is indeed much faster than any approach that needs to count / split into lines first. Just to note the constraints (which are _not_ a problem for the OP): the max. line length is assumed to be 257 bytes (easy to change), and it's possible for lines to get selected more than once. Can only be used with files in which you can seek, which rules out stdin input and FIFOs, for instance. – mklement0 Mar 23 '15 at 15:29
  • 1
    @mklement0 - yes, here are constraints - price for the speed. btw, 2*256 using `$seekdiff<<1` - e.g `256<<1` = 512 :) – clt60 Mar 23 '15 at 23:42
  • Re `256<<1 = 512`: Your _buffer_ is 512 bytes long, but you don't know where that falls relative to the beginning of a line. So, in the worst-case scenario, the 1st (and possibly only) `\n` in your buffer is at byte position 256, which means that the longest line you can fully return is 256 bytes long (I mistakenly said 257, originally). A final thought on the OP's requirements: since he wanted pre- and post-filters, you'd have to send the results of the pre-filter to a temp. file and pass its name to your Perl script. – mklement0 Mar 24 '15 at 13:08
  • 1
    @jm666 thanks gave me an inspiration to learn Perl ️ ️ For my case my line was big so I increased `seekdiff` – ARIF MAHMUD RANA Jan 19 '21 at 08:32
4

Here's a wee bash function for you. It grabs, as you say, a "batch" of lines, with a random start point within a file.

randline() {
  local lines c r _

  # cache the number of lines in this file in a symlink in the temp dir
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Pick a random number...
  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]
  echo "start=$r" >&2

  # And start displaying $2 lines before that number.
  head -n $r "$1" | tail -n ${2:-1}
}

Edit the echo lines as required.

This solution has the advantage of fewer pipes, less resource-intensive pipes (i.e. no | sort ... |), less platform dependence (i.e. no sort -R which is GNU-sort-specific).

Note that this relies on Bash's $RANDOM variable, which may or may not actually be random. Also, it will miss lines if your source file contains more than 32768^2 lines, and there's an failure edge case if the number of lines you've specificed (N) is >1 and the random start point is less than N lines from the beginning. Solving that is left as an exercise for the reader. :)


UPDATE #1:

mklement0 asks an excellent question in comments about potential performance issues with the head ... | tail ... approach. I honestly don't know the answer, but I would hope that both head and tail are optimized sufficiently that they wouldn't buffer ALL input prior to displaying their output.

On the off chance that my hope is unfulfilled, here's an alternative. It's an awk-based "sliding window" tail. I'll embed it in the earlier function I wrote so you can test it if you want.

randline() {
  local lines c r _

  # Line count cache, per the first version of this function...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  r=$[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) ]

  echo "start=$r" >&2

  # This simply pipes the functionality of the `head | tail` combo above
  # through a single invocation of awk.
  # It should handle any size of input file with the same load/impact.
  awk -v lines=${2:-1} -v count=0 -v start=$r '
    NR < start { next; }
    { out[NR]=$0; count++; }
    count > lines { delete out[start++]; count--; }
    END {
      for(i=start;i<start+lines;i++) {
        print out[i];
      }
    }
  ' "$1"
}

The embedded awk script replaces the head ... | tail ... pipeline in the previous version of the function. It works as follows:

  • It skips lines until the "start" as determined by earlier randomization.
  • It records the current line to an array.
  • If the array is greater than the number of lines we want to keep, it eliminates the first record.
  • At the end of the file, it prints the recorded data.

The result is that the awk process shouldn't grow its memory footprint because the output array gets trimmed as fast as it's built.

NOTE: I haven't actually tested this with your data.


UPDATE #2:

Hrm, with the update to your question that you want N random lines rather than a block of lines starting at a random point, we need a different strategy. The system limitations you've imposed are pretty severe. The following might be an option, also using awk, with random numbers still from Bash:

randlines() {
  local lines c r _

  # Line count cache...
  lines="/tmp/${1//\//-}.lines"
  if [ -h "$lines" ] && [ "$lines" -nt "${1}" ]; then
    c=$(ls -l "$lines" | sed 's/.* //')
  else
    read c _ < <(wc -l $1)
    ln -sfn "$c" "$lines"
  fi

  # Create a LIST of random numbers, from 1 to the size of the file ($c)
  for (( i=0; i<$2; i++ )); do
    echo $[ $c * ($RANDOM * 32768 + $RANDOM) / (32768 * 32768) + 1 ]
  done | awk '
    # And here inside awk, build an array of those random numbers, and
    NR==FNR { lines[$1]; next; }
    # display lines from the input file that match the numbers.
    FNR in lines
  ' - "$1"
}

This works by feeding a list of random line numbers into awk as a "first" file, then having awk print lines from the "second" file whose line numbers were included in the "first" file. It uses wc to determine the upper limit of the random numbers to generate. That means you'll be reading this file twice. If you have another source for the number of lines in the file (a database for example), do plug it in here. :)

A limiting factor might be the size of that first file, which must be loaded into memory. I believe that the 30000 random numbers should only take about 170KB of memory, but how the array gets represented in RAM depends on the implementation of awk you're using. (Though usually, awk implementations (including Gawk in Ubuntu) are pretty good at keeping memory wastage to a minimum.)

Does this work for you?

ghoti
  • 45,319
  • 8
  • 65
  • 104
  • Interesting approach, but I'm unclear whether the OP is really looking for _blocks of contiguous lines_ at a random starting location; other answers assume that each of the `n` lines should be selected randomly _individually_ - I've asked the OP for clarification. Aside from that: With the `head ... | tail ...` approach, is performance a concern in case the starting line is close to the end of a very large file? – mklement0 Mar 17 '15 at 20:17
  • Excellent point, @mklement0. I have no idea if the head + tail combo would be performant in extreme situations. I've updated my answer with an alternative, for the fun of it. I'd love to know your thoughts. – ghoti Mar 18 '15 at 07:02
  • Re update #1: For my tests I had to make two fixes: `start=$c` should be `start=$r`, and `{ delete out[start++]; count--; }` was missing an `exit`: `{ delete out[start++]; count--; exit } `. Also, to make this and the original function behave identically, I changed `head -n $r "$1"` in the original function to `head -n $(( r + ${2:-1} )) "$1"`. – mklement0 Mar 18 '15 at 14:08
  • I appreciate the follow-up. Re making heads or tails (update #1): On my OSX 10.10.2 machine, your `awk`-based solution is significantly faster with `gawk` or `mawk` than the `head ... | tail ..`. solution, whereas, curiously, the preinstalled BSD (BWK) `awk ` is SLOWER - no idea why BSD awk performs so poorly here. Here are the timings for getting 50 lines starting at the 90-millionth line of a 100-million lines file, in ascending order (take them with a grain of salt, this wasn't a well-controlled experiment): mawk: 0m8.069s; gawk: 0m12.353s; head + tail: 1m1.910s; BSD awk: 1m28.559s – mklement0 Mar 18 '15 at 14:09
  • Re update #2: The problem is that you're still extracting the lines in _input order_; if your input is, say, a list of alphabetically listed words, your output will be alphabetically sorted, too. To fix that, you'd have to cache the output lines and iterate over them in "file 1" order in an `END` block. Also, it's worth exiting the main loop once the extract count is reached. Building up the random-indices array in a `BEGIN` block inside `awk` (using `rand()`) rather than in bash would speed up things as well. – mklement0 Mar 23 '15 at 14:50
  • P.S.: The caching mechanism for the line count is great, but a little obscure (using a symlink's _filename_ to store the count). – mklement0 Mar 23 '15 at 14:51
2

Simple (but slow) solution

n=15 #number of random lines
filter_before | sort -R | head -$n | filter_after

#or, if you could have duplicate lines
filter_before | nl | sort -R | cut -f2- | head -$n | filter_after
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

or if you want, save the following into a randlines script

#!/bin/bash
nl | sort -R | cut -f2 | head -"${1:-10}"

and use it as:

filter_before | randlines 55 | filter_after   #for 55 lines

How it works:

The sort -R sorts the file by the calculated random hashes for each line, so you will get an randomised order of lines, therefore the first N lines are random lines.

Because the hashing produces the same hash for the same line, duplicate lines are not treated as different. Is possible eliminate the duplicate lines adding the line number (with nl), so the sort will never got an exact duplicate. After the sort removing the added line numbers.

example:

seq -f 'some line %g' 500 | nl | sort -R | cut -f2- | head -3

prints in subsequent runs:

some line 65
some line 420
some line 290

some line 470
some line 226
some line 132

some line 433
some line 424
some line 196

demo with duplicate lines:

yes 'one
two' | head -10 | nl | sort -R | cut -f2- | head -3

in subsequent runs print:

one
two
two

one
two
one

one
one
two

Finally, if you want could use, instead of the cut sed too:

sed -r 's/^\s*[0-9][0-9]*\t//'
clt60
  • 62,119
  • 17
  • 107
  • 194
  • 1
    Worth mentioning that `-R` is an _extension_ to the POSIX standard. Recent versions of both GNU and BSD sort do implement it, but, sadly, OSX is stuck on an ancient GNU sort version (v5.93 as of OSX 10.10) that does _not_ support it. – mklement0 Mar 17 '15 at 17:13
  • @mklement0 - yeah, you're right! I'm on yosemite too, and using GNU utils, installed by `macports`. – clt60 Mar 17 '15 at 17:16
  • 1
    It works fine; try `printf ' 1\t\ttwo\t\tthree' | cut -f2- | cat -t` – mklement0 Mar 17 '15 at 20:10
  • how does `sort -R` work when there is not enough RAM (to compare the hashes) ? does it go into swap ? also before my answer I tried `shuf` but it fails with an 'out of memory' notice – Stefan Rogin Mar 18 '15 at 06:15
  • 1
    Sort sorting in blocks and uses files temporary files if needed - not need to load the whole dataset into the memory. It will be _extremely slow_. In such conditions, (512RAM) maybe this isn't the best approach. (Depends: 1.) how many times you need get random lines 2.) how many lines you need each time 3.) how often the file content is changes). – clt60 Mar 18 '15 at 07:39
  • file doesn't change often, usually need 1-2 batches a day and have to have them as fast as I can. I tested with my solution and it finishes a 10k batch from 32 mil lines in ~30 sec (real 0m24.444s user 0m22.349s sys 0m1.650s) – Stefan Rogin Mar 18 '15 at 07:49
  • @clickstefan - GREAT to know that the file doesn't change often and that you refer to it multiple times per day. I've added line count caching to my answer, so if you're running this multiple times, you can avoid the extra read to the file if it hasn't been updated. – ghoti Mar 18 '15 at 13:59
  • @clickstefan added another answer what is very fast. – clt60 Mar 18 '15 at 17:03
0
#!/bin/bash
#contents of bashScript.sh

file="$1";
lineCnt=$2;
filter="$3";
nfilter="$4";
echo "getting $lineCnt lines from $file matching '$filter' and not matching '$nfilter'" 1>&2;

totalLineCnt=$(cat "$file" | grep "$filter" | grep -v "$nfilter" | wc -l | grep -o '^[0-9]\+');
echo "filtered count : $totalLineCnt" 1>&2;

chances=$( echo "$lineCnt/$totalLineCnt" | bc -l );
echo "chances : $chances" 1>&2;

cat "$file" | awk 'BEGIN { srand() } rand() <= $chances { print; }' | grep "$filter" | grep -v "$nfilter" | head -"$lineCnt";

usage:

get 1000 random sample

bashScript.sh /path/to/largefile.txt 1000  

line has numbers

bashScript.sh /path/to/largefile.txt 1000 "[0-9]"

no mike and jane

bashScript.sh /path/to/largefile.txt 1000 "[0-9]" "mike|jane"
Stefan Rogin
  • 1,499
  • 3
  • 25
  • 41
  • 1
    In [rare] cases I believe this can return less than the requested number of lines. – user2864740 Mar 17 '15 at 15:06
  • yes, I was about to specify that :) also it might not match lines from beginning and end. Anyway good catch – Stefan Rogin Mar 17 '15 at 15:08
  • `| grep -o '^[0-9]\+'` is not needed and will break on BSD-like platforms, for instance, where `wc -l` outputs leading whitespace. – mklement0 Mar 23 '15 at 15:32
  • Running the file through the filters twice is likely to make this quite slow. This _typically_ returns fewer lines than requested (which, as you've stated, you're OK with). You must pass the value of shell variable `$chances` via a `-v` option to awk - you cannot reference shell variables directly inside a single-quoted awk script (and using a double-quoted string is not a good idea). – mklement0 Mar 23 '15 at 15:40
0

I've used rlfor line randomnisation and found it to perform quite well. Not sure how it scales to your case (you'd simply do e.g. rl FILE | head -n NUM). You can get it here: http://arthurdejong.org/rl/

micans
  • 1,106
  • 7
  • 16
  • 1
    Interesting, but note that the `rl` page itself now recommends use of `shuf` instead, and the OP has stated in a comment that he ran out of memory using `shuf`. – mklement0 Mar 19 '15 at 13:15