Monte Carlo
As an alternative to reading the file line by line*
(*use the method by David Robinson to read the gzip file as a standard file):
If all the lines are roughly the same size you could jump to a random position in the file, backtrack character by character until you get to a newline and read the full line from that point. If the lines are exactly the same size this method is exact.
If however the lines are not the same size, but you know the distribution of having a line with length x
- you can do the method as above, but reject the overabundant x
's with probability P(x)
such that the probability of grabbing a random line in the file is constant.
Example:
To make this simple, let's say you have a 5-line file, with lengths X={2,3,5,5,5}
. Picking a random point in the file you have a 10% (2/(2+3+5+5+5)) chance of getting x1
, 15% of getting x2
, 50% chance of x3
. What you want is a 20%/20%/60%
probability respectively. The respective weights we are are W=(3/2, 1, 6/5)
, these are the numbers such that x1*w1 = 20%
, x2*w2 = 20%
, x3*w3=60%
. The normalizing factor is the sum of these weights Z = w1+w2+w3 = 37/10
. From here we know the probability for each of the lines:
P(w1) = w1/Z = 30/68
P(w2) = w2/Z = 20/68
P(w3) = w3/Z = 18/68
Note that P(w1)+P(w2)+3*P(w3)=1
, as it should.
For your algorithm choose a random point in the file. If the associated line has length 2, pick a random number between q=[0,1]
. If q>(30/68)
reject that spot and try again. If it is less stop and return that line.
When do you know X(w)
?
I'll admit that know the exact distribution of the lengths of lines may seem restrictive, however there are many procedurally generated files (log files, hardware data readouts, etc..) where the distribution is known exactly. In addition, if the distribution is known only approximately, we can use the method above to determine the sample rejection criteria as a best guess and go from there.
Monte Carlo?
This may not be the best method (who can compete with Knuth?), but it may offer some insight to solving the problem in a completely different way. For those unfamiliar, the method above is a form of importance sampling, a Monte Carlo method.
How to seek in a gzip file?
Per OP's request, here is a primer on seek
ing through a Python file object.
import gzip, random
# Helper function to create some test data
def line(char,n):
return ''.join([("%s"%char)*n,"\n"])
# Create the test data as in the example
filename = "test.zip"
FOUT = gzip.open(filename,'wb')
FOUT.write(line('a',2))
FOUT.write(line('b',3))
FOUT.write(line('c',5))
FOUT.write(line('d',5))
FOUT.write(line('e',5))
FOUT.close()
# Since we know the distribution, we know the length
length = 2+3+3*5+5 # 5 newlines
# Print 7 random points in the file
FIN = gzip.open(filename,'rb')
for n in xrange(7):
FIN.seek(random.randrange(length),0)
print "Position %3i, char: %s" %(FIN.tell(), [FIN.read(1)])
This has the output for a sample run as:
Position 8, char: ['c']
Position 23, char: ['e']
Position 15, char: ['d']
Position 10, char: ['c']
Position 4, char: ['b']
Position 16, char: ['d']
Position 2, char: ['\n']