5

Let's say I have a 531 gig gzipped textfile with exactly 512 548 457 601 475 lines split by '\n' and wanted to get a random line out of it without filesplitting. (Don't worry, it's not really THAT large; just wanted to state that it's a huge file and I know how many lines it has.)

How I would normally do it with a smaller compressed file:

import fileinput
import gzip
import random

list = []

for line in fileinput.input(file, openhook=gzip.open):
    list.append(line)

listLength = len(list)
randomListLineOne = line[random.randint(0, listLength)]
randomListLineTwo = line[random.randint(0, listLength)]
...

What I've found on the topic:

How do I read a random line from one file in python?

import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
      if random.randrange(num + 2): continue
      line = aline
    return line

Waterman's "Reservoir Algorithm" translated by Alex Martelli from Knuth's "The Art of Computer Programming"

Could you adapt this for compressed files? I tried setting my compressed file as afile but that didn't work. Or is there another (easier) way to achieve this?

Community
  • 1
  • 1
James Rather
  • 85
  • 1
  • 7
  • Do you realize the Waterman algorithm _is_ reading the lines in the file? – cheeken Feb 16 '12 at 18:58
  • Do the lines have a set size? Like always 512 bytes per line? – user978122 Feb 16 '12 at 19:00
  • @cheeken- he means not reading the entire file into memory at once. – David Robinson Feb 16 '12 at 19:06
  • @user978122 They vary very slightly. Would reading in chunks (which I'm assuming you're thinking about) still work if the file's compressed? – James Rather Feb 16 '12 at 19:07
  • I'm thinking a calculated jump. If what you want is on line 732, and each line is 512 bytes, then you only need to move the iostream iterator to 732 * 512. Python isn't my primary language, but I am attempting similar (as it just so happens) in C# and a Zip file. Hell, if you aren't hurting for space, just pad out the rest of the lines with spaces to get the right size [512] (the zip component should compress that part for you, so it's more of a non-issue). You will just need to adapt the idea to fit your needs. ^_^ – user978122 Feb 16 '12 at 19:18
  • @user978122 Thank you, Hooked proposed something similar below. I would not even know where to begin implementing something like this. But I'm working on a zip based filestructure that might solve it. We'll see. – James Rather Feb 17 '12 at 12:33
  • gzip is a stream compression format so you can't just jump around in the file. You'd have to read the block information from the front of the file, estimate which block the line you want fell into, read the block header to see which compression method was used on that block, uncompress that block, and then extract the line of uncompressed text out of it like you would a small file. – Drewfer Feb 17 '12 at 15:44
  • Hmm, perhaps. http://dotnetzip.codeplex.com/ -> this is for C#, but some of their samples include an MemoryStream / IOStream. I liked the idea of being able to extract information from a zipped XML file without having to unzip it first. I think they mention GZip as being supported, but I have not had time to read more. Unfortunately, he's programming in Python, so this may not be of use to him; however, if he can find a similar implementation in Python... – user978122 Feb 17 '12 at 17:19

3 Answers3

5

Monte Carlo

As an alternative to reading the file line by line*

(*use the method by David Robinson to read the gzip file as a standard file):

If all the lines are roughly the same size you could jump to a random position in the file, backtrack character by character until you get to a newline and read the full line from that point. If the lines are exactly the same size this method is exact.

If however the lines are not the same size, but you know the distribution of having a line with length x - you can do the method as above, but reject the overabundant x's with probability P(x) such that the probability of grabbing a random line in the file is constant.

Example:

To make this simple, let's say you have a 5-line file, with lengths X={2,3,5,5,5}. Picking a random point in the file you have a 10% (2/(2+3+5+5+5)) chance of getting x1, 15% of getting x2, 50% chance of x3. What you want is a 20%/20%/60% probability respectively. The respective weights we are are W=(3/2, 1, 6/5), these are the numbers such that x1*w1 = 20%, x2*w2 = 20%, x3*w3=60%. The normalizing factor is the sum of these weights Z = w1+w2+w3 = 37/10. From here we know the probability for each of the lines:

 P(w1) = w1/Z = 30/68
 P(w2) = w2/Z = 20/68
 P(w3) = w3/Z = 18/68

Note that P(w1)+P(w2)+3*P(w3)=1, as it should.

For your algorithm choose a random point in the file. If the associated line has length 2, pick a random number between q=[0,1]. If q>(30/68) reject that spot and try again. If it is less stop and return that line.

When do you know X(w)?

I'll admit that know the exact distribution of the lengths of lines may seem restrictive, however there are many procedurally generated files (log files, hardware data readouts, etc..) where the distribution is known exactly. In addition, if the distribution is known only approximately, we can use the method above to determine the sample rejection criteria as a best guess and go from there.

Monte Carlo?

This may not be the best method (who can compete with Knuth?), but it may offer some insight to solving the problem in a completely different way. For those unfamiliar, the method above is a form of importance sampling, a Monte Carlo method.

How to seek in a gzip file?

Per OP's request, here is a primer on seeking through a Python file object.

import gzip, random

# Helper function to create some test data
def line(char,n): 
    return ''.join([("%s"%char)*n,"\n"])

# Create the test data as in the example
filename = "test.zip"
FOUT = gzip.open(filename,'wb')
FOUT.write(line('a',2))
FOUT.write(line('b',3))
FOUT.write(line('c',5))
FOUT.write(line('d',5))
FOUT.write(line('e',5))
FOUT.close()

# Since we know the distribution, we know the length
length = 2+3+3*5+5 # 5 newlines

# Print 7 random points in the file
FIN = gzip.open(filename,'rb')
for n in xrange(7):
    FIN.seek(random.randrange(length),0)
    print "Position %3i, char: %s" %(FIN.tell(), [FIN.read(1)])

This has the output for a sample run as:

Position   8, char: ['c']
Position  23, char: ['e']
Position  15, char: ['d']
Position  10, char: ['c']
Position   4, char: ['b']
Position  16, char: ['d']
Position   2, char: ['\n']
Community
  • 1
  • 1
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Thank You, I've only used Monte Carlo in rendering before and would never have thought of applying it here. My lines also don't vary that much, mostly they are just one character longer or shorter; so I don't think MC is necessary. But this would be awesome for files with a greater variance. How would you go to a random line between 0 and the known line length in the file and backtrack to '\n'? Because that alone would solve my question. I haven't found anything on that topic. – James Rather Feb 17 '12 at 12:22
  • @JamesRather I've added the example you've asked for. In the case where the variance is large the MC method would be slower (more reasons for rejection!). If the lines are a known fixed length and are constrained to a small set, then the MC method would excel. For your case, if you don't mind the slight inaccuracy you can simply seek to a random point in the file. – Hooked Feb 17 '12 at 14:51
  • @JamesRather I put the MC method up because I've never seen anyone else do it that way either, I thought it would be a fun way to attack the problem. – Hooked Feb 17 '12 at 14:55
  • seek() works like a charm. Thank you very much. I read somewhere that it had to read the whole file into memory first and so didn't bother with it anymore. That was obviously wrong information. I'll use your MC approach for sure when I get a dataset that calls for it. Thanks again. – James Rather Feb 17 '12 at 17:49
2

You can simply use the "read a random line from one file in Python" approach, but open the file as a gzip file rather than a regular file using the gzip package.

import gzip
import random

def random_line(afile):
    line = next(afile)
    for num, aline in enumerate(afile):
        if random.randrange(num + 2): continue
        line = aline
    return line

afile = gzip.open("myfile.zip")
print random_line(afile)
afile.close()
David Robinson
  • 77,383
  • 16
  • 167
  • 187
0

Forgive the (very) late answer but you can use the seek() method to position in the file, if you know the size of the file from gunzip -l.
Then, throw away the next read, as it will probably be a partial line and use the subsequent read as your random data.

Print 10 random lines from a gzipped text file.

import random
import gzip, os
f = gzip.open("some.txt.gz","r")
unc_size = os.popen('gunzip -lq some.txt.gz').read()
unc_size = unc_size.strip().split(" ",1)
unc_size = unc_size[1].strip().split(" ",1)
for x in range(1,11):
    f.seek(random.randint(0,int(unc_size[0])))
    dump = next(f)
    print "Random line from byte pos ",f.tell(), next(f)
f.close() 
Rolf of Saxony
  • 21,661
  • 5
  • 39
  • 60