53

I have a 10^7 lines file, in which I want to choose 1/100 of lines randomly from the file. This is the AWK code I have, but it slurps all the file content before hand. My PC memory cannot handle such slurps. Is there other approach to do it?

awk 'BEGIN{srand()}
!/^$/{ a[c++]=$0}
END {  
  for ( i=1;i<=c ;i++ )  { 
    num=int(rand() * c)
    if ( a[num] ) {
        print a[num]
        delete a[num]
        d++
    }
    if ( d == c/100 ) break
  }
 }' file
Steven Huwig
  • 20,015
  • 9
  • 55
  • 79
neversaint
  • 60,904
  • 137
  • 310
  • 477
  • FYI 1% of 10,000,000 is too big -- you only need about 1000 to have a +/-3% margin of error and 10,000 for a +/-1% MOE. – Steven Huwig Mar 28 '09 at 08:17
  • @Steven: thanks so much for this. How did you derive the MOE figure above? Any reference? I truly want to learn more since my stat background is weak. BTW, your opinion seems to be different from cadrian's below (i.e. 1% is not enough) – neversaint Mar 28 '09 at 12:46
  • 1
    http://www.americanresearchgroup.com/moe.html – Steven Huwig Mar 28 '09 at 13:25
  • @StevenHuwig The link seems to be offline. Cache: https://web.archive.org/web/20131104235055/http://www.americanresearchgroup.com/moe.html Wikipedia: https://en.wikipedia.org/wiki/Margin_of_error – exic Feb 27 '14 at 15:57

11 Answers11

91

if you have that many lines, are you sure you want exactly 1% or a statistical estimate would be enough?

In that second case, just randomize at 1% at each line...

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}'

If you'd like the header line plus a random sample of lines after, use:

awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01 || FNR==1) print $0}'
Tom Morris
  • 10,490
  • 32
  • 53
cadrian
  • 7,332
  • 2
  • 33
  • 42
  • The original can technically try to choose an element beyond the end of the array, since it's using delete to avoid duplicates. This one doesn't have that problem. – paxdiablo Mar 28 '09 at 06:24
  • @Steven: true, but statistically with such a huge file I don't think it will. I tried with "man awk" which is 2300 lines long and I always had a ~20 lines (between 17 and 22) extract. – cadrian Mar 28 '09 at 08:05
  • 10
    @Steven: Theoretically yes, practically no. The probability of printing no lines, for the OP's 10 million line file (given by the Poisson distribution), is about 10^-43430. Equivalent to cracking a 144 *kilobyte* encryption key by guessing on the first try. – David Z Mar 28 '09 at 17:53
  • 3
    Also, for the OP's case (10^7 lines) there's a 99.8% chance that the chosen number of elements will be within 1% of 10,000. – David Z Mar 28 '09 at 18:09
  • 4
    Good point, but still, a correct _algorithm_ needs to be correct, not "extremely likely to be correct." IMHO anyway. – Steven Huwig Mar 28 '09 at 19:55
  • 2
    Note that this algorithm will have a bias towards longer lines. If you want to correct against this, you must weight your sampling (e.g. sample lines that are 2x longer 1/2 as often). Precisely how to do this is left as an exercise for the reader (hint: build an empirical distribution of line lengths). – medriscoll Sep 05 '09 at 23:20
  • 9
    @Steven: No. An algorithm needs to be correct *enough*. In this case it is correct enough. – Reid Jun 07 '13 at 23:15
  • 1
    @Steven: You should perhaps study up on the definition of algorithm. Also be sure never to use UUIDs because they could "theoretically" collide. – Reid Jun 10 '13 at 18:47
  • @Reid The problem as originally stated (four years ago, don't you have anything better to do?) asked for 1% of lines, not 1% of lines within some arbitrary tolerance. The method outlined in this answer is unsound for producing 1% of the lines of the input. UUIDs have nothing to do with the question. – Steven Huwig Jun 10 '13 at 19:53
  • 2
    Does the regex mentioned here match every line? If so, are you just doing it because it's a clever way to avoid doing a for loop over every line? – Sammaron Aug 31 '16 at 20:39
59

You used awk, but I don't know if it's required. If it's not, here's a trivial way to do w/ perl (and without loading the entire file into memory):

cat your_file.txt | perl -n -e 'print if (rand() < .01)'

(simpler form, from comments):

perl -ne 'print if (rand() < .01)' your_file.txt 
Gregg Lind
  • 20,690
  • 15
  • 67
  • 81
Bill
  • 3,584
  • 5
  • 35
  • 47
  • Theoretically that could print no values. – Steven Huwig Mar 28 '09 at 07:55
  • See my comments on cadrian's answer – David Z Mar 28 '09 at 18:05
  • 18
    @Steven, read the original post. His file had 10^7 lines. The odds of him not getting output are .99^100000000. Saying that isn't acceptable is ridiculous. If you're worried about that level of error, you shouldn't be using a computer. You're more likely to get incorrect output due to comsic rays. – Bill Mar 31 '09 at 04:57
  • Okay, now *that's* legit. :-) – Bill Apr 04 '09 at 22:12
  • 3
    This version (second example) is crazy fast. I just ran through a 15GB file on an SSD in no time flat. Got a 150MB file back. Nice. – markwatson Mar 05 '15 at 04:12
  • And in contrast, the awk version above is crazy slow. To the next random person who finds this answer from the googles: try this answer first! – yshavit Jul 06 '16 at 23:00
21

I wrote this exact code in Gawk -- you're in luck. It's long partially because it preserves input order. There are probably performance enhancements that can be made.

This algorithm is correct without knowing the input size in advance. I posted a rosetta stone here about it. (I didn't post this version because it does unnecessary comparisons.)

Original thread: Submitted for your review -- random sampling in awk.

# Waterman's Algorithm R for random sampling
# by way of Knuth's The Art of Computer Programming, volume 2

BEGIN {
    if (!n) {
        print "Usage: sample.awk -v n=[size]"
        exit
    }
    t = n
    srand()

}

NR <= n {
    pool[NR] = $0
    places[NR] = NR
    next

}

NR > n {
    t++
    M = int(rand()*t) + 1
    if (M <= n) {
        READ_NEXT_RECORD(M)
    }

}

END {
    if (NR < n) {
        print "sample.awk: Not enough records for sample" \
            > "/dev/stderr"
        exit
    }
    # gawk needs a numeric sort function
    # since it doesn't have one, zero-pad and sort alphabetically
    pad = length(NR)
    for (i in pool) {
        new_index = sprintf("%0" pad "d", i)
        newpool[new_index] = pool[i]
    }
    x = asorti(newpool, ordered)
    for (i = 1; i <= x; i++)
        print newpool[ordered[i]]

}

function READ_NEXT_RECORD(idx) {
    rec = places[idx]
    delete pool[rec]
    pool[NR] = $0
    places[idx] = NR  
} 
Community
  • 1
  • 1
Steven Huwig
  • 20,015
  • 9
  • 55
  • 79
  • This is "randomly choose n lines" instead of "randomly choose 1/100 lines", so a little precalculation is needed. Still cool, though. – ephemient Mar 30 '09 at 17:49
  • Yep, though I commented on the question that "randomly choose n lines" is better -- sample size isn't in direct proportion to population size. – Steven Huwig Mar 30 '09 at 17:54
  • 1
    here's [Reservoir Sampling algorithm implementation in Python (without preserving order)](http://stackoverflow.com/a/32792504/4279). Note: [try/except here](http://stackoverflow.com/a/537064/4279) doesn't buy you anything: the probability that the *replacement* occurs for `i`-th item is `k / i` where `i` >> `k` (note: the probability that `i`-th item is *chosen* is `k / n` where `n` is the total number of items). – jfs Sep 26 '15 at 01:35
  • This is a great script, all I did was add `#!gawk -f` as the first line to make it work on my system (and gawk `brew install gawk` cause I have a mac) I just have one question: `Is this truly reservoir sampling?` – Jeremy Iglehart Feb 16 '18 at 15:40
  • 1
    @JeremyIglehart Yes, the reservoir is called `pool` here. It's not exactly the same as envisioned in the original materials, because of the dynamic associative arrays in awk vs more static arrays in other languages, but since the important characteristic of reservoir sampling is being "on-line" rather than it having a particular algorithmic complexity, I think it still counts. My added requirement of preserving order is just running two algorithms at once over the same collection, which doesn't change the reservoir sampling part into not being a reservoir sampling part. – Steven Huwig Feb 26 '18 at 20:12
17

This should work on most any GNU/Linux machine.

$ shuf -n $(( $(wc -l < $file) / 100)) $file

I'd be surprised if memory management was done inappropriately by the GNU shuf command.

ashawley
  • 4,195
  • 1
  • 27
  • 40
5

I don't know awk, but there is a great technique for solving a more general version of the problem you've described, and in the general case it is quite a lot faster than the for line in file return line if rand < 0.01 approach, so it might be useful if you intend to do tasks like the above many (thousands, millions) of times. It is known as reservoir sampling and this page has a pretty good explanation of a version of it that is applicable to your situation.

advait
  • 173
  • 1
  • 5
  • here's [Python implementation of reservoir sampling algorithm (k==1)](http://askubuntu.com/a/527778/3712). – jfs Sep 24 '14 at 18:04
5

The problem of how to uniformly sample N elements out of a large population (of unknown size) is known as Reservoir Sampling. (If you like algorithms problems, do spend a few minutes trying to solve it without reading the algorithm on Wikipedia.)

A web search for "Reservoir Sampling" will find a lot of implementations. Here is Perl and Python code that implements what you want, and here is another Stack Overflow thread discussing it.

Community
  • 1
  • 1
Tudor Bosman
  • 61
  • 1
  • 5
5

In this case, reservoir sampling to get exactly k values is trivial enough with awk that I'm surprised no solution has suggested that yet. I had to solve the same problem and I wrote the following awk program for sampling:

#!/usr/bin/env awk -f
BEGIN{
    srand();
    if(k=="") k=10
}

NR <= k {
    reservoir[NR-1] = $0;
    next;
}

{ i = int(NR * rand()) }

i < k { reservoir[i] = $0 }

END {
    for (i in reservoir) {
        print reservoir[i];
    }
}

If saved as sample_lines and made executable, it can be run like: ./sample_lines -v k=5 input_file. If k is not given, then 10 will be used by default.

Then figuring out what k is has to be done separately, for example by setting -v "k=$(dc -e "$(cat input_file | wc -l) 100 / n")"

onlynone
  • 7,602
  • 3
  • 31
  • 50
kqr
  • 14,791
  • 3
  • 41
  • 72
3

You could do it in two passes:

  • Run through the file once, just to count how many lines there are
  • Randomly select the line numbers of the lines you want to print, storing them in a sorted list (or a set)
  • Run through the file once more and pick out the lines at the selected positions

Example in python:

fn = '/usr/share/dict/words'

from random import randint
from sys import stdout

count = 0
with open(fn) as f:
   for line in f:
      count += 1

selected = set()
while len(selected) < count//100:
   selected.add(randint(0, count-1))

index = 0
with open(fn) as f:
   for line in f:
      if index in selected:
          stdout.write(line)
      index += 1
sth
  • 222,467
  • 53
  • 283
  • 367
1

If the aim is just to avoid memory exhaustion, and the file is a regular file, no need to implement reservoir sampling. The number of lines in the file can be known if you do two passes in the file, one to get the number of lines (like with wc -l), one to select the sample:

file=/some/file
awk -v percent=0.01 -v n="$(wc -l < "$file")" '
  BEGIN {srand(); p = int(n * percent)}
  rand() * n-- < p {p--; print}' < "$file"
Stephane Chazelas
  • 5,859
  • 2
  • 34
  • 31
1

Instead of waiting until the end to randomly pick your 1% of lines, do it every 100 lines in "/^$/". That way, you only hold 100 lines at a time.

Travis Jensen
  • 5,362
  • 3
  • 36
  • 40
  • This leads to a different distribution of random lines. E.g., you'll never have two from the same 100-line set. – derobert Mar 28 '09 at 07:51
  • It would also affect the order. You'd never have a line from the third set before a line from the second set. Important considerations, for sure. – Travis Jensen Mar 28 '09 at 17:42
0

Here's my version. In the below 'c' is the number of lines to select from the input. Making c a parameter is left as an exercise for the reader, as is the reason the line starting with c/NR works to reliably select exactly c lines (assuming input has at least c lines).

#!/bin/sh

gawk '
BEGIN   { srand(); c = 5 }
c/NR >= rand() { lines[x++ % c] = $0 }
END { for (i in lines)  print lines[i] }

' "$@"