Randomly selecting lines from files

Question

I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?

Does the header tell you the number of entries in the file? Are the entries fixed in size within a given file? — , Jun 09 '10 at 20:55
Please post an example of what you're dealing with. And what you've tried so far. — Zaid, Jun 09 '10 at 20:56
See this question for selecting a random line from a file: http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c — Mark Ransom, Jun 09 '10 at 21:00
Header just tells the time stamp and what each row contains and things like that — AlgoMan, Jun 09 '10 at 21:01
@Ziad I was writing a python script that takes filenaes as input randomly selects a file and reads the file and randomly selects couple of lines with the random number+ header length — AlgoMan, Jun 09 '10 at 21:10
@AlgoMan: You still haven't answered one question. Are the entries fixed in size? — , Jun 09 '10 at 21:42
@Moron Usually the entries are per day entry, But there might be missing data, so size is not fixed — AlgoMan, Jun 09 '10 at 21:46

score 7 · Answer 1 · answered Jun 09 '10 at 20:56

7

You may find perlfaq5 useful.

answered Jun 09 '10 at 20:56

Eugene Yarmash

142,882
41
325
378

2

Its worth noting that its a single-pass algorithm which only keeps two lines in memory at any time. – Schwern Jun 09 '10 at 22:48
I would have posted this as CW, since you're (presumably) not the author of this faq item. – Ether Jun 09 '10 at 22:55
1

+1 Clever algorithm. I learned something today. Nit: I guess nowadays, we should drop the preceding "srand". – tsee Jun 10 '10 at 08:17

FMc · Accepted Answer · 2010-06-10T13:34:12.680

If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.

Here's an intuitive explanation for the algorithm.

Process the file line by line.
pick = line, with probability 1/N, where N = line number

In other words, on line 1, we will pick line 1 with 1/1 probability. On line 2, we will change the pick to line 2, with 1/2 probability. On line 3, we will change the pick to line 3, with 1/3 probability. Etc.

For an intuitive proof, imagine a file with 3 lines:

        1            Pick line 1.
       / \
     .5  .5
     /     \
    2       1        Switch to line 2?
   / \     / \
 .67 .33 .33 .67
 /     \ /     \
2       3       1    Switch to line 3?

The probability for each outcome:

Line 1: .5 * .67     = 1/3
Line 2: .5 * .67     = 1/3
Line 3: .5 * .33 * 2 = 1/3

From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4 chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4).

Here's the Perl FAQ answer, adapted to your problem.

use strict;
use warnings;

# Ignore 5 lines.
<> for 1 .. 5;

# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
    my @lines;
    $lines[$_] = <> for 0 .. 1;

    $n ++;
    @picks = @lines if rand($n) < 1;
}

print @picks;

Great explanation of the algorithm. Nit: I think the OP said that two lines make up an entry, so your program would need a minor modification to account for that (i.e. add another readline, divide no. of lines in the rand() call by two). Crazy idea: one might be able to use `File::Stream`, which let's you use a regexp in $/ to read two lines at once. Of course, that would be a bad idea in production because the module's extremely slow. — tsee, Jun 10 '10 at 09:43

score 3 · Answer 3 · answered Jun 09 '10 at 20:56

3

sed "1,5d" < FILENAME | sort -R | head -2

answered Jun 09 '10 at 20:56

zed_0xff

32,417
7
53
72

1

Interesting. but where is the "-R" switch (presumably for "Randomize" supported? I just checked on Linux (RHEL5.4, coreutils 5.97...), Mac OS X (10.5.8), and FreeBSD (6.4). – Jim Dennis Jun 09 '10 at 21:22

Nas Banov · Answer 4 · 2010-06-10T01:50:12.310

Python solution - reads file only once and requires little memory

Invoke like so getRandomItems(file('myHuge.log'), 5, 2) - will return list of 2 lines

from random import randrange

def getRandomItems(f, skipFirst=0, numItems=1):
    for _ in xrange(skipFirst):
        f.next()
    n = 0; r = []
    while True:
        try:
            nxt = [f.next() for _ in range(numItems)]
        except StopIteration: break
        n += 1
        if not randrange(n):
            r = nxt
    return r

Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f is an iterator (supports next() method). Hence we can pass something different than file, say we want to see the distribution:

>>> s={}
>>> for i in xrange(5000):
...     r = getRandomItems(iter(xrange(50)))[0]
...     s[r] = 1 + s.get(r,0)
... 
>>> for i in s: 
...     print i, '*' * s[i]
... 
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************

tgray · Answer 5 · 2010-06-09T21:31:37.477

1

Answer is in Python. Assuming you can read a whole file into memory.

#using python 2.6
import sys
import os
import itertools
import random

def main(directory, num_files=5, num_entries=5):
    file_paths = os.listdir(directory)

    # get a random sampling of the available paths
    chosen_paths = random.sample(file_paths, num_files)

    for path in chosen_paths:
        chosen_entries = get_random_entries(path, num_entries)
        for entry in chosen_entries:
            # do something with your chosen entries
            print entry

def get_random_entries(file_path, num_entries):
    with open(file_path, 'r') as file:
        # read the lines and slice off the headers
        lines = file.readlines()[5:]

        # group the lines into pairs (i.e. entries)
        entries = list(itertools.izip_longest(*[iter(lines)]*2))

        # return a random sampling of entries
        return random.sample(entries, num_entries)

if __name__ == '__main__':
    #use optparse here to do fancy things with the command line args
    main(sys.argv[1:])

Links: itertools, random, optparse

edited Jun 09 '10 at 21:31

answered Jun 09 '10 at 21:24

tgray

8,826
5
36
41

My python version on the server where the log is . 2.4.3 – AlgoMan Jun 10 '10 at 00:47
@EnTerr It's exactly as many lines as your answer if you take out the whitespace and comments; both of which are lacking in yours. – Jon-Eric Jun 10 '10 at 01:40
ugh, loading all lines in memory is silly - say there a million lines in file of 50MB? – Nas Banov Jun 10 '10 at 02:04
@EnTerr, I wouldn't worry about 50MB or a million lines. The problems might begin when we get up near 500MB - 1GB. because at any given point there could be up to two copies of the files content in memory (plus a little overhead from the other variables). – tgray Jun 10 '10 at 15:30

score 0 · Answer 6 · answered Nov 04 '12 at 00:04

Two other means to do so: 1- by generators (may still require a lot of memory): http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html

2- by a clever seeking (best method actually): http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html

I here copy the code of the clever Jonathan Kupferman:

#!/usr/bin/python

import os,random

filename="averylargefile"
file = open(filename,'r')

#Get the total file size
file_size = os.stat(filename)[6]

while 1:
      #Seek to a place in the file which is a random distance away
      #Mod by file size so that it wraps around to the beginning
      file.seek((file.tell()+random.randint(0,file_size-1))%file_size)

      #dont use the first readline since it may fall in the middle of a line
      file.readline()
      #this will return the next (complete) line from the file
      line = file.readline()

      #here is your random line in the file
      print line

score 0 · Answer 7 · answered Jun 09 '10 at 21:36

0

Another Python option; reading the contents of all files into memory:

import random
import fileinput

def openhook(filename, mode):
    f = open(filename, mode)
    headers = [f.readline() for _ in range(5)]
    return f

num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)

answered Jun 09 '10 at 21:36

Matt Anderson

19,311
11
41
57

My python version on the server where the log is . 2.4.3 – AlgoMan Jun 10 '10 at 00:44

Randomly selecting lines from files

7 Answers7

Python solution - reads file only once and requires little memory

Linked