4

I have a file with contents

abc
def
high
lmn
...
...

There are more than 2 million lines in the files. I want to randomly sample lines from the files and output 50K lines. Any thoughts on how to approach this problem? I was thinking along the lines of Perl and its rand function (Or a handy shell command would be neat).

Related (Possibly Duplicate) Questions:

Community
  • 1
  • 1
kal
  • 28,545
  • 49
  • 129
  • 149

5 Answers5

13

Assuming you basically want to output about 2.5% of all lines, this would do:

print if 0.025 > rand while <$input>;
Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
  • 1
    If the file size varies, you could calculate the percentage by counting the lines (cf. perlfaq5) and dividing that into the number of lines desired. – Michael Carman Jun 23 '09 at 20:08
  • 1
    This is a really good solution because it avoids the naive approaches to solving this problem, which involve jumping to random points in the file or (even worse!) sorting the input. – James Thompson Jun 24 '09 at 07:36
  • @James Thompson: while it might look like a good solution, it actually is not a correct solution for the question. There is no way to guarantee that it will return 50k rows. –  Jun 24 '09 at 07:47
  • Mine is a perfectly fine solution if you want to sample roughly 2.5% of all lines. Mine is not the right solution if the requirement is to output exactly 50,000 lines. I explicitly stated the approximate nature of this. For the latter problem, I believe I once read a single-pass algorithm but I cannot remember it now. – Sinan Ünür Jun 24 '09 at 11:10
  • 1
    The algorithm that Sinan is thinking of is called the Reservoir Sampling Algorithm. It's covered well on this site and elsewhere on the Internet. – James Thompson Jul 26 '09 at 23:20
5

Shell way:

sort -R file | head -n 50000
3

From perlfaq5: "How do I select a random line from a file?"


Short of loading the file into a database or pre-indexing the lines in the file, there are a couple of things that you can do.

Here's a reservoir-sampling algorithm from the Camel Book:

srand;
rand($.) < 1 && ($line = $_) while <>;

This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.

You can use the File::Random module which provides a function for that algorithm:

use File::Random qw/random_line/;
my $line = random_line($filename);

Another way is to use the Tie::File module, which treats the entire file as an array. Simply access a random array element.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
  • Following this answer lead me to a great description of reservoir sampling, and an easy way to extend the Camel Book code from one line to `k` items: http://stackoverflow.com/a/12733515/2016618 – Sarkom May 02 '14 at 22:56
2

If you need to extract an exact number of lines:

use strict;
use warnings;

# Number of lines to pick and file to pick from
# Error checking omitted!
my ($pick, $file) = @ARGV;

open(my $fh, '<', $file)
    or die "Can't read file '$file' [$!]\n";

# count lines in file
my ($lines, $buffer);
while (sysread $fh, $buffer, 4096) {
    $lines += ($buffer =~ tr/\n//);
}

# limit number of lines to pick to number of lines in file
$pick = $lines if $pick > $lines;

# build list of N lines to pick, use a hash to prevent picking the
# same line multiple times
my %picked;
for (1 .. $pick) {
    my $n = int(rand($lines)) + 1;
    redo if $picked{$n}++
}

# loop over file extracting selected lines
seek($fh, 0, 0);
while (<$fh>) {
    print if $picked{$.};
}
close $fh;
Michael Carman
  • 30,628
  • 10
  • 74
  • 122
  • 1
    Really nice approach. The only thing missing is check if $pick <= $lines - otherwise it will hang on the for() loop. –  Jun 23 '09 at 20:47
  • Bug:int(rand($lines)) can return a 0 but $. starts at 1. – Jeremy Leipzig Nov 30 '10 at 22:24
  • @jermdemo: Argh, and `rand` returns a value less than the argument, so it wouldn't pick the last line. Silly 1-based variables... I added a `+1` to fix both edge cases. – Michael Carman Dec 01 '10 at 15:03
2

Perl way:

use CPAN. There is module File::RandomLine that does exactly what you need.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339