3

I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:

  • These files may contain headers (some have multi-line headers).
  • Headers need to appear in the output in the same order.
  • Each line can be taken only once.
  • The largest file currently is about 150GB (about 60 000 000 lines).
  • Lines are roughly the same length in a file, but may vary between different files.
  • I will usually be taking 5000 random lines (I may need up to 1 000 000 lines)

Currently I have written the following code:

inputSize=os.path.getsize(options.input)
usedPositions=[] #Start positions of the lines already in output

with open(options.input) as input:
    with open(options.output, 'w') as output:

        #Handling of header lines
        for i in range(int(options.header)):
            output.write(input.readline())
            usedPositions.append(input.tell())

        # Find and write all random lines, except last
        for j in range(int(args[0])):
            input.seek(random.randrange(inputSize)) # Seek to random position in file (probably middle of line)
            input.readline() # Read the line (probably incomplete). Next input.readline() results in a complete line.
            while input.tell() in usedPositions: # Take a new line if current one is taken
                input.seek(random.randrange(inputSize))
                input.readline() 
            usedPositions.append(input.tell()) # Add line start position to usedPositions
            randomLine=input.readline() # Complete line
            if len(randomLine) == 0: # Take first line if end of the file is reached
                input.seek(0)
                for i in range(int(options.header)): # Exclude headers
                    input.readline()
                randomLine=input.readline()
            output.write(randomLine)            

This code seems to be working correctly.

I am aware that this code prefers lines that follow the longest lines in input, because seek() is most likely to return a position on the longest line and the next line is written to output. This is irrelevant as lines in the input file are roughly the same length. Also I am aware that this code results in an infinite loop if N is larger than number of lines in input file. I will not implement a check for this, as getting the line count takes a lot of time.

RAM and HDD limitations are irrelevant. I am only concerned about the speed of the program. Is there a way to further optimize this code? Or perhaps there is a better approach?

EDIT: To clarify, the lines in one file have roughly the same length. However, i have multiple files that this script needs to run on and the average length of a line will be different for these files. For example file A may have ~100 characters per line and file B ~50000 characters per line. I do not know the average line length of any file beforehand.

FableBlaze
  • 1,785
  • 3
  • 16
  • 21
  • This is more on topic in http://codereview.stackexchange.com/ –  Sep 05 '12 at 10:08
  • At the very least related: [Python random lines from subfolders](http://stackoverflow.com/questions/12128948/12129336#12129336) – Martijn Pieters Sep 05 '12 at 10:20
  • Do you need x random lines *from each file*, or *across all lines in all files*? In other words, do you take 10 lines from file 1, 10 from the file 2, etc, or do you take 5000 random lines from all files together? What do you mean by *"Headers need to appear in the output in the same order."*; you seem to be skipping them. – Martijn Pieters Sep 05 '12 at 10:24
  • The script needs to work with only a single input file. Meaning N number of lines from file F. (both given by parameters at calling the file) Input file may contain header lines (number of these lines with a parameter). Lets say that i have 3 header lines. Then at the start of the program i just write these three lines to output. Afterwards i skip them, because i do not want them to appear in the middle of random lines. – FableBlaze Sep 05 '12 at 11:37

5 Answers5

8

There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:

You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.

Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.

This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:

#! /usr/bin/python
# coding: utf-8

import random, os


CHUNK_SIZE = 1000
PATH = "/var/log/cron"

def pick_next_random_line(file, offset):
    file.seek(offset)
    chunk = file.read(CHUNK_SIZE)
    lines = chunk.split(os.linesep)
    # Make some provision in case yiou had not read at least one full line here
    line_offset = offset + len(os.linesep) + chunk.find(os.linesep) 
    return line_offset, lines[1]

def get_n_random_lines(path, n=5):
    lenght = os.stat(path).st_size
    results = []
    result_offsets = set()
    with open(path) as input:
        for x in range(n):
            while True:
                offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
                if not offset in result_offsets:
                    result_offsets.add(offset)
                    results.append(line)
                    break
    return results

if __name__ == "__main__":
    print get_n_random_lines(PATH)
jsbueno
  • 99,910
  • 10
  • 151
  • 209
  • I love this! I have a similar problem and I was convinced I'd have to read the file twice. – David M Sep 13 '18 at 15:18
  • I could be wrong, but as you've described it I think it would be impossible to randomly select the first line, since you're always grabbing a chunk from your randomly selected position forward, and using the next line. – David M Sep 13 '18 at 15:48
  • it can be balanced and fine-tuned so that the chances of the first and last line being picked are the same - that would not be hard. It'd still be better to funnel the file into an indexed sqlite table, and use sql to know the file size and retrieve random lines. – jsbueno Jul 10 '20 at 13:13
4

If you need a uniform sample of N lines in your file, you need to know the exact number of lines to pick from; seeking at random doesn't do this, longer lines skew the results in favour of lines directly following the longest lines.

Luckily, you only need to read your file once to pick those N lines. You basically pick your N first lines (in random order), then randomly replace picked lines with new ones with a diminishing probability based on the number of lines read.

For N == 1, the chance that the nth line read replaces the previous random pick is randint(0, n) < 1, so, the second line has a 50% chance of being picked, the third has a 33.33% chance, etc. For larger N, replace one of the already picked lines in your set at random as more lines are read, with the same distribution.

In Python random lines from subfolders, Blkknght wrote a very helpful function for picking a random sample of size N from an iterable:

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

This is trivial to combine with your requirements to preserve a set of headers:

from itertools import islice

with open(options.input) as input:
    with open(options.output, 'w') as output:

        # Handling of header lines
        # Use islice to avoid buffer issues with .readline()
        for line in islice(input, int(options.header)):
            output.write(line)

        # Pick a random sample
        for line in random_sample(int(args[0]), input):
            output.write(line)

This will read your whole file in one go, pick a uniform random sample, and write it out to the output file. Thus, this has Θ(L) complexity, with L being the number of lines in the file.

Community
  • 1
  • 1
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Luckily i can live with somewhat skewed results (at least in the context of my current task). But this seems like the way to go if i needed an uniform sample. +1 from me – FableBlaze Sep 10 '12 at 09:39
3

I believe it would be faster to randomly choose N line numbers, and then to go over the file once, line by line and take the lines who's number is in your list. Currently you have to seek to the random place for each random number so it's O(N*M) where M is the size of the file. What I suggest is O(M).

zenpoy
  • 19,490
  • 9
  • 60
  • 87
  • I do not know the number of lines beforehand. Therefore this approach requires me to loop tough the input file twice. First to get the line count. Second to print the selected lines. Essentially reading all the lines twice and second time deciding if the line goes to output (is in the list of random numbers). I will test it as it does not take long to write, but i doubt that reading all the lines twice will be faster than N number (ideally) of seeks. – FableBlaze Sep 05 '12 at 12:17
  • Maybe you can estimate the number of lines by dividing the size of the file by the average size of a line (which you said you know) , in any case going over the file twice is still O(M). – zenpoy Sep 05 '12 at 12:24
  • I think i expressed myself a bit poorly about the line length. I know that lines in one file will be roughly the same length. However the average line length between files A and B may be different (for example 100 and 50000 characters). Edited the question accordingly. – FableBlaze Sep 05 '12 at 13:10
  • Estimating the number of lines may have three results. First, i estimate correctly and program works as intended. Second, estimation is too high, which results in selecting a line that does not exist (then i can select a new random line). Third, estimation is too low and last line(s) have no chance of being in the output (can do nothing about it?). In any case i will try this approach and see how it goes. – FableBlaze Sep 05 '12 at 13:14
  • So you can read only the first line and use its length. It doesn't really matter since 1 or 2 iterations is negligible comparing to ~5000 iterations you have in your original algorithm. You should just try it and see if it runs faster. – zenpoy Sep 05 '12 at 13:14
  • Tested it with selecting 5000 random rows out of ~60000000. My original algorithm is a lot faster. This actually makes sense if you think about it. My algorithm needs ~5000 iterations. Looping trough the file takes ~60 million iterations (one for each line). If i count the lines then this adds up to ~120 million iterations. – FableBlaze Sep 05 '12 at 14:32
  • @anti666: You only have to read through the file *once*, see my answer. If you can live with the weighted results (lines directly following longer lines are more likely to be picked), then use your own function. – Martijn Pieters Sep 06 '12 at 08:24
1
  • Obvious improvement would be to use set() for your usedPositions variable - lookup will be faster, and since you need to handle up to 10^6 used positions, lookup time is not irrelevant.
  • Use xrange instead of range in a for loop. Allocating full list of integers doesn't seem necessary.
Abgan
  • 3,696
  • 22
  • 31
0

Untested (and requires reading the file twice):

import random

N = 5000
with open('file.in') as fin:
    line_count = sum(1 for i in fin)
    fin.seek(0)
    to_take = set(random.sample(xrange(line_count), N))
    for lineno, line in enumerate(fin):
        if lineno in to_take:
            pass # use it

However, since you mention that lines are "roughly" the same size, then you could use os.path.getsize and divide it by the average line length (whether already known, or sniffed from N many lines from the file), then use that to generate line_count - it'd be close enough for a random sample.

You could also mmap the file and use a combination of filesize, average line length, best guess of number of lines, and a random line number to 'seek' and then, just search backwards or forwards to the next start of line. (Since mmap will enable you to treat it like a string, you'll be able to use .index with an offset or use re if you really wanted to).

Jon Clements
  • 138,671
  • 33
  • 247
  • 280