5

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

AlgoMan
  • 2,785
  • 6
  • 34
  • 40

7 Answers7

7

You may find perlfaq5 useful.

Eugene Yarmash
  • 142,882
  • 41
  • 325
  • 378
  • 2
    Its worth noting that its a single-pass algorithm which only keeps two lines in memory at any time. – Schwern Jun 09 '10 at 22:48
  • I would have posted this as CW, since you're (presumably) not the author of this faq item. – Ether Jun 09 '10 at 22:55
  • 1
    +1 Clever algorithm. I learned something today. Nit: I guess nowadays, we should drop the preceding "srand". – tsee Jun 10 '10 at 08:17
7

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

Here's an intuitive explanation for the algorithm.

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

For an intuitive proof, imagine a file with 3 lines:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

Here's the Perl FAQ answer, adapted to your problem.

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;
FMc
  • 41,963
  • 13
  • 79
  • 132
  • 1
    Great explanation of the algorithm. Nit: I think the OP said that two lines make up an entry, so your program would need a minor modification to account for that (i.e. add another readline, divide no. of lines in the rand() call by two). Crazy idea: one might be able to use `File::Stream`, which let's you use a regexp in $/ to read two lines at once. Of course, that would be a bad idea in production because the module's extremely slow. – tsee Jun 10 '10 at 09:43
3
sed "1,5d" < FILENAME | sort -R | head -2
zed_0xff
  • 32,417
  • 7
  • 53
  • 72
  • 1
    Interesting. but where is the "-R" switch (presumably for "Randomize" supported? I just checked on Linux (RHEL5.4, coreutils 5.97...), Mac OS X (10.5.8), and FreeBSD (6.4). – Jim Dennis Jun 09 '10 at 21:22
3

Python solution - reads file only once and requires little memory

Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines

from random import randrange

def getRandomItems(f, skipFirst=0, numItems=1):
    for _ in xrange(skipFirst):
        f.next()
    n = 0; r = []
    while True:
        try:
            nxt = [f.next() for _ in range(numItems)]
        except StopIteration: break
        n += 1
        if not randrange(n):
            r = nxt
    return r

Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:

>>> s={}
>>> for i in xrange(5000):
...     r = getRandomItems(iter(xrange(50)))[0]
...     s[r] = 1 + s.get(r,0)
... 
>>> for i in s: 
...     print i, '*' * s[i]
... 
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************
Nas Banov
  • 28,347
  • 6
  • 48
  • 67
1

Answer is in Python. Assuming you can read a whole file into memory.

#using python 2.6
import sys
import os
import itertools
import random

def main(directory, num_files=5, num_entries=5):
    file_paths = os.listdir(directory)

    # get a random sampling of the available paths
    chosen_paths = random.sample(file_paths, num_files)

    for path in chosen_paths:
        chosen_entries = get_random_entries(path, num_entries)
        for entry in chosen_entries:
            # do something with your chosen entries
            print entry

def get_random_entries(file_path, num_entries):
    with open(file_path, 'r') as file:
        # read the lines and slice off the headers
        lines = file.readlines()[5:]

        # group the lines into pairs (i.e. entries)
        entries = list(itertools.izip_longest(*[iter(lines)]*2))

        # return a random sampling of entries
        return random.sample(entries, num_entries)

if __name__ == '__main__':
    #use optparse here to do fancy things with the command line args
    main(sys.argv[1:])

Links: itertools, random, optparse

tgray
  • 8,826
  • 5
  • 36
  • 41
  • My python version on the server where the log is . 2.4.3 – AlgoMan Jun 10 '10 at 00:47
  • @EnTerr It's exactly as many lines as your answer if you take out the whitespace and comments; both of which are lacking in yours. – Jon-Eric Jun 10 '10 at 01:40
  • ugh, loading all lines in memory is silly - say there a million lines in file of 50MB? – Nas Banov Jun 10 '10 at 02:04
  • @EnTerr, I wouldn't worry about 50MB or a million lines. The problems might begin when we get up near 500MB - 1GB. because at any given point there could be up to two copies of the files content in memory (plus a little overhead from the other variables). – tgray Jun 10 '10 at 15:30
0

Two other means to do so: 1- by generators (may still require a lot of memory): http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html

2- by a clever seeking (best method actually): http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html

I here copy the code of the clever Jonathan Kupferman:

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line
gaborous
  • 15,832
  • 10
  • 83
  • 102
0

Another Python option; reading the contents of all files into memory:

import random
import fileinput

def openhook(filename, mode):
    f = open(filename, mode)
    headers = [f.readline() for _ in range(5)]
    return f

num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)
Matt Anderson
  • 19,311
  • 11
  • 41
  • 57