I have bunch of files and very file has a header of 5 lines. In the rest of the file, pair of line form an entry. I need to randomly select entry from these files. How can i select random files and random entry(pair of line, excluding header) ?
-
2How big are these files? – Felix Kling Jun 09 '10 at 20:51
-
1Does the header tell you the number of entries in the file? Are the entries fixed in size within a given file? – Jun 09 '10 at 20:55
-
About 1000000 entries in each file. – AlgoMan Jun 09 '10 at 20:55
-
3Please post an example of what you're dealing with. And what you've tried so far. – Zaid Jun 09 '10 at 20:56
-
1See this question for selecting a random line from a file: http://stackoverflow.com/questions/232237/whats-the-best-way-to-return-a-random-line-in-a-text-file-using-c – Mark Ransom Jun 09 '10 at 21:00
-
Header just tells the time stamp and what each row contains and things like that – AlgoMan Jun 09 '10 at 21:01
-
@Ziad I was writing a python script that takes filenaes as input randomly selects a file and reads the file and randomly selects couple of lines with the random number+ header length – AlgoMan Jun 09 '10 at 21:10
-
@AlgoMan: You still haven't answered one question. Are the entries fixed in size? – Jun 09 '10 at 21:42
-
@Moron Usually the entries are per day entry, But there might be missing data, so size is not fixed – AlgoMan Jun 09 '10 at 21:46
-
@Ether No its not a homework. – AlgoMan Jun 10 '10 at 18:39
7 Answers
You may find perlfaq5 useful.

- 142,882
- 41
- 325
- 378
-
2Its worth noting that its a single-pass algorithm which only keeps two lines in memory at any time. – Schwern Jun 09 '10 at 22:48
-
I would have posted this as CW, since you're (presumably) not the author of this faq item. – Ether Jun 09 '10 at 22:55
-
1+1 Clever algorithm. I learned something today. Nit: I guess nowadays, we should drop the preceding "srand". – tsee Jun 10 '10 at 08:17
If the file is small enough, read the pairs of lines into memory and select randomly from that data structure. If the file is too large, Eugene Y provides the right answer: use reservoir sampling.
Here's an intuitive explanation for the algorithm.
Process the file line by line.
pick = line, with probability 1/N, where N = line number
In other words, on line 1, we will pick line 1 with 1/1
probability. On line 2, we will change the pick to line 2, with 1/2
probability. On line 3, we will change the pick to line 3, with 1/3
probability. Etc.
For an intuitive proof, imagine a file with 3 lines:
1 Pick line 1.
/ \
.5 .5
/ \
2 1 Switch to line 2?
/ \ / \
.67 .33 .33 .67
/ \ / \
2 3 1 Switch to line 3?
The probability for each outcome:
Line 1: .5 * .67 = 1/3
Line 2: .5 * .67 = 1/3
Line 3: .5 * .33 * 2 = 1/3
From there, the rest is induction. For example, suppose the file has 4 lines. We've already convinced ourselves that as of line 3, every line so far (1, 2, 3) will have an equal chance of being our current selection. When we advance to line 4, it will have a 1/4
chance of being picked -- exactly what it should be, thus reducing the probabilities on the previous 3 lines by exactly the right amount (1/3 * 3/4 = 1/4
).
Here's the Perl FAQ answer, adapted to your problem.
use strict;
use warnings;
# Ignore 5 lines.
<> for 1 .. 5;
# Use reservoir sampling to select pairs from remaining lines.
my (@picks, $n);
until (eof){
my @lines;
$lines[$_] = <> for 0 .. 1;
$n ++;
@picks = @lines if rand($n) < 1;
}
print @picks;

- 41,963
- 13
- 79
- 132
-
1Great explanation of the algorithm. Nit: I think the OP said that two lines make up an entry, so your program would need a minor modification to account for that (i.e. add another readline, divide no. of lines in the rand() call by two). Crazy idea: one might be able to use `File::Stream`, which let's you use a regexp in $/ to read two lines at once. Of course, that would be a bad idea in production because the module's extremely slow. – tsee Jun 10 '10 at 09:43
sed "1,5d" < FILENAME | sort -R | head -2

- 32,417
- 7
- 53
- 72
-
1Interesting. but where is the "-R" switch (presumably for "Randomize" supported? I just checked on Linux (RHEL5.4, coreutils 5.97...), Mac OS X (10.5.8), and FreeBSD (6.4). – Jim Dennis Jun 09 '10 at 21:22
Python solution - reads file only once and requires little memory
Invoke like so getRandomItems(file('myHuge.log'), 5, 2)
- will return list of 2 lines
from random import randrange
def getRandomItems(f, skipFirst=0, numItems=1):
for _ in xrange(skipFirst):
f.next()
n = 0; r = []
while True:
try:
nxt = [f.next() for _ in range(numItems)]
except StopIteration: break
n += 1
if not randrange(n):
r = nxt
return r
Returns empty list if it could not get the first passable items from f. The code's only requirement is that argument f
is an iterator (supports next()
method). Hence we can pass something different than file, say we want to see the distribution:
>>> s={}
>>> for i in xrange(5000):
... r = getRandomItems(iter(xrange(50)))[0]
... s[r] = 1 + s.get(r,0)
...
>>> for i in s:
... print i, '*' * s[i]
...
0 ***********************************************************************************************
1 **************************************************************************************************************
2 ******************************************************************************************************
3 ***************************************************************************
4 *************************************************************************************************************************
5 ********************************************************************************
6 **********************************************************************************************
7 ***************************************************************************************
8 ********************************************************************************************
9 ********************************************************************************************
10 ***********************************************************************************************
11 ************************************************************************************************
12 *******************************************************************************************************************
13 *************************************************************************************************************
14 ***************************************************************************************************************
15 *****************************************************************************************************
16 ********************************************************************************************************
17 ****************************************************************************************************
18 ************************************************************************************************
19 **********************************************************************************
20 ******************************************************************************************
21 ********************************************************************************************************
22 ******************************************************************************************************
23 **********************************************************************************************************
24 *******************************************************************************************************
25 ******************************************************************************************
26 ***************************************************************************************************************
27 ***********************************************************************************************************
28 *****************************************************************************************************
29 ****************************************************************************************************************
30 ********************************************************************************************************
31 ********************************************************************************************
32 ****************************************************************************************************
33 **********************************************************************************************
34 ****************************************************************************************************
35 **************************************************************************************************
36 *********************************************************************************************
37 ***************************************************************************************
38 *******************************************************************************************************
39 **********************************************************************************************************
40 ******************************************************************************************************
41 ********************************************************************************************************
42 ************************************************************************************
43 ****************************************************************************************************************************
44 ****************************************************************************************************************************
45 ***********************************************************************************************
46 *****************************************************************************************************
47 ***************************************************************************************
48 ***********************************************************************************************************
49 ****************************************************************************************************************

- 28,347
- 6
- 48
- 67
Answer is in Python. Assuming you can read a whole file into memory.
#using python 2.6
import sys
import os
import itertools
import random
def main(directory, num_files=5, num_entries=5):
file_paths = os.listdir(directory)
# get a random sampling of the available paths
chosen_paths = random.sample(file_paths, num_files)
for path in chosen_paths:
chosen_entries = get_random_entries(path, num_entries)
for entry in chosen_entries:
# do something with your chosen entries
print entry
def get_random_entries(file_path, num_entries):
with open(file_path, 'r') as file:
# read the lines and slice off the headers
lines = file.readlines()[5:]
# group the lines into pairs (i.e. entries)
entries = list(itertools.izip_longest(*[iter(lines)]*2))
# return a random sampling of entries
return random.sample(entries, num_entries)
if __name__ == '__main__':
#use optparse here to do fancy things with the command line args
main(sys.argv[1:])

- 8,826
- 5
- 36
- 41
-
-
@EnTerr It's exactly as many lines as your answer if you take out the whitespace and comments; both of which are lacking in yours. – Jon-Eric Jun 10 '10 at 01:40
-
ugh, loading all lines in memory is silly - say there a million lines in file of 50MB? – Nas Banov Jun 10 '10 at 02:04
-
@EnTerr, I wouldn't worry about 50MB or a million lines. The problems might begin when we get up near 500MB - 1GB. because at any given point there could be up to two copies of the files content in memory (plus a little overhead from the other variables). – tgray Jun 10 '10 at 15:30
Two other means to do so: 1- by generators (may still require a lot of memory): http://www.usrsb.in/Picking-Random-Items--Take-Two--Hacking-Python-s-Generators-.html
2- by a clever seeking (best method actually): http://www.regexprn.com/2008/11/read-random-line-in-large-file-in.html
I here copy the code of the clever Jonathan Kupferman:
#!/usr/bin/python
import os,random
filename="averylargefile"
file = open(filename,'r')
#Get the total file size
file_size = os.stat(filename)[6]
while 1:
#Seek to a place in the file which is a random distance away
#Mod by file size so that it wraps around to the beginning
file.seek((file.tell()+random.randint(0,file_size-1))%file_size)
#dont use the first readline since it may fall in the middle of a line
file.readline()
#this will return the next (complete) line from the file
line = file.readline()
#here is your random line in the file
print line

- 15,832
- 10
- 83
- 102
Another Python option; reading the contents of all files into memory:
import random
import fileinput
def openhook(filename, mode):
f = open(filename, mode)
headers = [f.readline() for _ in range(5)]
return f
num_entries = 3
lines = list(fileinput.input(openhook=openhook))
print random.sample(lines, num_entries)

- 19,311
- 11
- 41
- 57