Python random lines from subfolders

Question

I have many tasks in .txt files in multiple sub folders. I am trying to pick up a total 10 tasks randomly from these folders, their contained files and finally a text line within a file. The selected line should be deleted or marked so it will be not picked in the next execution. This may be too broad a question but I'd appreciate any input or direction.

Here's the code I have so far:

#!/usr/bin/python  
import random   
with open('C:\\Tasks\\file.txt') as f:  
    lines = random.sample(f.readlines(),10)    
print(lines)

Do you want 10 random lines from every file or 10 lines _in total_? — Rob Cowie, Aug 26 '12 at 09:22
Are lines in these files unique? Do you expect lines/files to be added between runs? Do these files contain tens or millions of lines? — Henk Langeveld, Aug 26 '12 at 09:31
possible duplicate of [how do i create a LIST of unique random numbers?](http://stackoverflow.com/questions/9755538/how-do-i-create-a-list-of-unique-random-numbers) — , Aug 26 '12 at 09:37
yes, lines in these files is unique. no, i don't expect lines/files to be added between runs. no, files not contain tens or millions of lines. but may be around 1000 ~ 2000 lines. Thank you.! — user1582596, Aug 26 '12 at 09:41
have you considered moving the data into a real database such as SQLite instead using a file system as an adhoc db? — jfs, Aug 27 '12 at 06:35
@J.F. Sebastian, Thank you for the input, i've considered MySQL, but it makes for me too hard to express what i am trying to solve. — user1582596, Aug 27 '12 at 08:12
@user1582596: [Simple Random Samples from a (My)Sql database](http://stackoverflow.com/questions/249301/simple-random-samples-from-a-mysql-database) — jfs, Aug 28 '12 at 02:30

score 15 · Answer 1 · edited May 23 '17 at 12:34

15

Here's a simple solution that makes just one pass through the files per sample. If you know exactly how many items you will be sampling from the files, it is probably optimal.

First off is the sample function. This uses the same algorithm that @NedBatchelder linked to in a comment on an earlier answer (though the Perl code shown there only selected a single line, rather than several). It selects values from of an iterable of lines, and only requires the currently selected lines to be kept in memory at any given time (plus the next candidate line). It raises a ValueError if the iterable has fewer values than the requested sample size.

import random

def random_sample(n, items):
    results = []

    for i, v in enumerate(items):
        r = random.randint(0, i)
        if r < n:
            if i < n:
                results.insert(r, v) # add first n items in random order
            else:
                results[r] = v # at a decreasing rate, replace random items

    if len(results) < n:
        raise ValueError("Sample larger than population.")

    return results

edit: In another question, user @DzinX noticed that the use of insert in this code makes the performance bad (O(N^2)) if you're sampling a very large number of values. His improved version which avoids that issue is here. /edit

Now we just need to make a suitable iterable of items for our function to sample from. Here's how I'd do it using a generator. This code will only keep one file open at a time, and it does not need more than one line in memory at a time. The optional exclude parameter, if present, should be a set containing lines that have been selected on a previous run (and so should not be yielded again).

import os

def lines_generator(base_folder, exclude = None):
    for dirpath, dirs, files in os.walk(base_folder):
        for filename in files:
            if filename.endswith(".txt"):
                fullPath = os.path.join(dirpath, filename)
                with open(fullPath) as f:
                     for line in f:
                         cleanLine = line.strip()
                         if exclude is None or cleanLine not in exclude:
                             yield cleanLine

Now, we just need a wrapper function to tie those two pieces together (and manage a set of seen lines). It can return a single sample of size n or a list of count samples, taking advantage of the fact that a slice from a random sample is also a random sample.

_seen = set()

def get_sample(n, count = None):
    base_folder = r"C:\Tasks"
    if count is None:
        sample = random_sample(n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return sample
    else:
        sample = random_sample(count * n, lines_generator(base_folder, _seen))
        _seen.update(sample)
        return [sample[i * n:(i + 1) * n] for i in range(count)]

Here's how it can be used:

def main():
    s1 = get_sample(10)
    print("Sample1:", *s1, sep="\n")

    s2, s3 = get_sample(10,2) # get two samples with only one read of the files
    print("\nSample2:", *s2, sep="\n")
    print("\nSample3:", *s3, sep="\n")

    s4 = get_sample(5000) # this will probably raise a ValueError!

edited May 23 '17 at 12:34

Community

1
1

answered Aug 26 '12 at 23:28

Blckknght

100,903
11
120
169

you could write: `(letter for word in sentence for letter in word if good(letter))` instead of `chain.from_iterable((for letter in word if good(letter)) for word in sentence)` – jfs Aug 27 '12 at 05:54
Hmm, you're right. I think I started using `chain.from_iter` when I was trying something different and it's unnecessary in the version I ended up with and posted. A straight generator expression is clearer, so I'll try that instead (I think it will also save me a line, since I won't need to strip the lines separately). – Blckknght Aug 27 '12 at 06:09
1

You could also write explicit for-loops and `yield line` in the `task_pipeline()`. It should produce the most readable version. In addition it is natural to use `with open(filename) as file:` in this case (you want this if the tree contains large number of txt files to avoid "Too many open files" error) – jfs Aug 27 '12 at 06:28
@MartijnPieters: You're missing the `if r < n` check on an earlier line. That represents the decreasing probability of a replacement happening after you've got the first n values. You're right that it's possible that the algorithm will return less than n values, but that will only happen if there are fewer than n values in the items iterable (it will return them all, in random order). – Blckknght Aug 27 '12 at 07:17
Right, indeed, the `r < n` will prevent IndexErrors, I missed that. :-P Retracted both comments. – Martijn Pieters Aug 27 '12 at 07:19
If you were to store the line number with the line (`(i, v)`) you can retrieve these and skip these lines in a future run. – Martijn Pieters Aug 27 '12 at 07:21
@J.F.Sebastian: Yeah, the explicitly looping version might be easier, but I had my heart set on generator expressions. They may not be ideal. Regarding avoiding duplicates, one solution is simply to simply take a larger sample at the start and use slices of it. Because the sampled values are in random order, all subsets are also random samples. Of course, you need to know exactly how many items you'll need in total for this to work. – Blckknght Aug 27 '12 at 07:43
@Blckknght, i am getting error for the colon in code, sample = random_sample(10, task_pipeline(base_folder)): – user1582596 Aug 27 '12 at 12:05
@DavidFraser: Yup, if you input less than n lines.. Otherwise it'll produce exactly n items. – Martijn Pieters Aug 27 '12 at 15:43
Thanks for the feedback everyone. I've updated the answer with a bunch of changes, including making the "pipeline" function into an explicit generator rather than using a bunch of generator expressions (I also renamed it). I've also added some code to avoid duplication if several samples are requested in a row, but this part is probably much slower than the basic sample algorithm. If you know in advance how many samples are going to be needed, get them all at once for the best performance! I also made the sample function raise a ValueError if it doesn't get enough values. – Blckknght Aug 27 '12 at 19:37
@user1582596. Yep, that was a typo. Sorry about that. It should be gone from the new code. – Blckknght Aug 27 '12 at 19:38
On Python 3.3: `with open(os.path.join(dirpath, filename)) as file: yield from file` Note: It removes `line.strip()`, the `exclude` check. – jfs Aug 27 '12 at 23:02
@Blckknght, 10x, i was trying to make this work but somehow i am in a stat where there is no errors nor output. trying to debug ... – user1582596 Aug 28 '12 at 09:35

score 4 · Accepted Answer · edited May 23 '17 at 10:24

To get a proper random distribution across all these files, you'd need to view them as one big set of lines and pick 10 at random. In other words, you'll have to read all these files at least once to at least figure out how many lines you have.

You do not need to hold all the lines in memory however. You'd have to do this in two phases: index your files to count the number of lines in each, then pick 10 random lines to be read from these files.

First indexing:

import os

root_path = r'C:\Tasks\\'
total_lines = 0
file_indices = dict()

# Based on https://stackoverflow.com/q/845058, bufcount function
def linecount(filename, buf_size=1024*1024):
    with open(filename) as f:
        return sum(buf.count('\n') for buf in iter(lambda: f.read(buf_size), ''))

for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
         if not filename.endswith('.txt'):
             continue
         path = os.path.join(dirpath, filename)
         file_indices[total_lines] = path
         total_lines += linecount(path)

offsets = list(file_indices.keys())
offsets.sort()

Now we have a mapping of offsets, pointing to filenames, and a total line count. Now we pick ten random indices, and read these from your files:

import random
import bisect

tasks = list(range(total_lines))
task_indices = random.sample(tasks, 10)

for index in task_indices:
     # find the closest file index
     file_index = offsets[bisect.bisect(offsets, index) - 1]
     path = file_indices[file_index]
     curr_line = file_index
     with open(path) as f:
         while curr_line <= index:
             task = f.readline()
             curr_line += 1
     print(task)
     tasks.remove(index)

Note that you only need the indexing once; you can store the result somewhere and only update it when your files update.

Also note that your tasks are now 'stored' in the tasks list; these are indices to lines in your files, and I remove the index from that variable when printing the task selected. Next time you run the random.sample() choices, the tasks previously picked will no longer be available for picking the next time. This structure will need updating if your files ever do change, as the indexes have to be re-calculated. The file_indices will help you with that task, but that is outside the scope of this answer. :-)

If you need only one 10-item sample, use Blckknght's solution instead, as it only will go through the files once, while mine require 10 extra file openings. If you need multiple samples, this solution only requires 10 extra file openings every time you need your sample, it won't scan through all the files again. If you have fewer than 10 files, still use Blckknght's answer. :-)

Thank you, while indexing, got following error. Traceback (most recent call last): File "", line 1, in AttributeError: 'dict_keys' object has no attribute 'sort'. btw, i am trying this with Python 3.2.3 — user1582596, Aug 26 '12 at 13:15
@user1582596: Ah, important distinction, I've updated the code for you now. — Martijn Pieters, Aug 26 '12 at 13:20
You don't actually need to know how many total lines there are to choose 10 randomly. You can choose one line at random by reducing the probability for each line that it is the one you keep: http://www.perlmonks.org/?node_id=1910 . For N lines, you keep a list of N, and for each new line, reduce the probability that you keep it: http://www.perlmonks.org/?node_id=1910 (sorry for all the Perl). — Ned Batchelder, Aug 26 '12 at 13:29
@NedBatchelder: Glad to see that that method still requires you to read through all the files though. :-P Reading between the lines though I am pretty certain the OP wants to pick a random 10 tasks more than once. In my setup you only need to scan the files once, then pick out samples as needed. — Martijn Pieters, Aug 26 '12 at 13:42
@MartijnPieters: yeah, the OP was a bit vague about "removing". The random line-from-file method is a uniform distribution, that's the interesting thing about the technique. — Ned Batchelder, Aug 26 '12 at 19:13

Rob Cowie · Answer 3 · 2012-08-27T11:16:49.363

0

EDIT: On closer scrutiny this answer does not fit the bill. Reworking it led me to the reservoir sampling algorithm, which @Blckknght used in his answer. So ignore this answer.

Few ways of doing it. Here's one...

Get a list of all task files
Select one at random
Select a single line from that file at random
Repeat until we have the desired number of lines

The code...

import os
import random

def file_iterator(top_dir):
    """Gather all task files"""
    files = []
    for dirpath, dirnames, filenames in os.walk(top_dir):
        for filename in filenames:
            if not filename.endswith('.txt'):
                continue
            path = os.path.join(dirpath, filename)
            files.append(path)
    return files


def random_lines(files, number=10):
    """Select a random file, select a random line until we have enough
    """
    selected_tasks = []

    while len(selected_tasks) < number:
        f = random.choice(files)
        with open(f) as tasks:
            lines = tasks.readlines()
            l = random.choice(lines)
            selected_tasks.append(l)
    return selected_tasks


## Usage
files = file_iterator(r'C:\\Tasks')
random_tasks = random_lines(files)

edited Aug 27 '12 at 11:16

answered Aug 26 '12 at 19:25

Rob Cowie

22,259
6
62
56

This can lead to duplicate choices, and I doubt the distribution of the sample will be uniform. How do you remember or remove selected tasks on a future run? From the OP: *The selected line should be deleted or marked so it will be not picked in the next execution.* – Martijn Pieters Aug 26 '12 at 20:49
Doh, I should have read more carefully. Bit late to modify my answer now. I'll get to it tomorrow. I suspect a simple solution is to turn the lines list into a set – Rob Cowie Aug 26 '12 at 23:21
10x, @Martijn Pieters, got the following errors,Traceback (most recent call last): File "C:\Dropbox\Python\testr1.py", line 31, in files = file_iterator(r'C:\\Dropbox\\ans7i\\') File "C:\Dropbox\Python\testr1.py", line 11, in file_iterator path = os.path.join(dirpath, filename) UnboundLocalError: local variable 'filename' referenced before assignment – user1582596 Aug 27 '12 at 08:01
Those two lines need to be indented one more level; I'll fix them. – Martijn Pieters Aug 27 '12 at 08:06
@Martijn Pieters, so wonderful to see this in action, wondering is there any quick way to add folder, filename hierarchy as prefix so easy to find the task where it's coming from. Example [Task][Do1][Do2][DL.txt] [task]; [Task][Do3][Do5][DL20.txt] [task] Also added statement “print (random_tasks) “ but output is appearing as one paragraph and makes a bit unreadable. – user1582596 Aug 27 '12 at 08:30
You have the path in `dirpath` and you can loop over `random_tasks` to print individual items. – Martijn Pieters Aug 27 '12 at 08:31
@Martijn Pieters, couldn't able to figureout yet, wondering if you inline the code. TIA. i also need some the help so each task will start with new line. – user1582596 Aug 27 '12 at 10:03
Meh, this answer isn't really useful. Blckknght and Martijn both provided better answers. Look to those answers, not this one. – Rob Cowie Aug 27 '12 at 11:15
10x @RobCowie, while i am considering all other option as possible answer. i am equally curious to know how dirpath can be implement in this case as i spent time to solve by myself and i couldn't, also it would be good learning experience. TIA. – user1582596 Aug 27 '12 at 15:21
Sorry, I'm not sure what you mean by 'dirpath'. Please explain further. – Rob Cowie Aug 27 '12 at 21:52
@RobCowie, 10x for the time, currently, this code helps me to pullout text lines but wondering how could i include directory path (including filename) as preifx. so it will helpful to determine the where the task is coming from. As Example [Task][Do1][Do2][DL.txt] [task1]; [Task][Do3][Do5][DL20.txt] [task2] Also added statement “print (random_tasks) “ but output is appearing as in a paragraph and makes a bit unreadable. I also need some the help so each task will start with new line. – user1582596 Aug 28 '12 at 09:33

Python random lines from subfolders

3 Answers3

Linked