19

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can't hold all the data in memory. Is there a good way to do this in python/command line that takes a reasonable amount of time (couple of days)?

I was thinking I could I touch 50 empty files. Stream through the 2 billion line file and randomly distribute each line to one of the 50 empty files. Then cat the 50 files. Would there be any major systematic bias to this method?

daemonk
  • 427
  • 6
  • 9
  • Please edit your answer to add this comment, it will be easier to follow. – merours Jun 30 '14 at 14:31
  • 2
    The problem with your 50 files method is that the lines you encounter first in the overall file will end up at the beginning of each of the 50. This means that the shuffle will not be perfect - it will not evenly distribute the elements irrespective of their starting position. – Matthew Franglen Jun 30 '14 at 14:33
  • The best thing I can think of is to actually split the entire thing down to single lines, and then do an arbitrary merge sort to recombine to a file, but that would break the disk inode limit and would involve plenty of disk IO (probably breaking the time limit). – Matthew Franglen Jun 30 '14 at 14:35
  • I'd take it one step further and randomize the order you read in the intermediary files, but that sounds like a fairly decent approach to me. – g.d.d.c Jun 30 '14 at 14:36
  • Can you hold range(2 000 000 000) in memory? If so, I would shuffle it (via the random module, for instance), then loop over this new list to print line after line. The thing is for this step a direct access to each line would be suitable. Maybe cut the big file into smaller files before? – user189 Jun 30 '14 at 14:36
  • Does it need to be python? I would think there would be some options with standard unix commands (assuming you are on a unix system). – brechmos Jun 30 '14 at 14:37
  • Also ck: http://stackoverflow.com/questions/12279017/python-random-n-lines-from-large-file-no-duplicate-lines – brechmos Jun 30 '14 at 14:38
  • Holding a range of 2 billion requires at least a 32bit int per number. That means 4 bytes per number, which leads to about 8 gig of memory. More feasible. – Matthew Franglen Jun 30 '14 at 14:38
  • Hmm, maybe get the memory to hold the shuffled range(2billion). Then write a function `getline` which enumerates the file like this http://stackoverflow.com/questions/2081836/reading-specific-lines-only-python and returns the next line from the shuffle, then write it to the output file and flush the internal buffer once in a while? – timgeb Jun 30 '14 at 14:49
  • @user189 **Anyone** can hold `range(2000000000)` in memory, because `range` takes O(1) space. It's `list(range(2000000000))` that may cause troubles. – Bakuriu Aug 07 '16 at 10:42
  • @Bakuriu O(1) might not be the right terminology here, but if i were to guess your point, i'd say it depends on your Python version. – user189 Aug 08 '16 at 20:15

8 Answers8

8

If you can reserve 16 GB of memory for this program, I wrote a program called sample that shuffles the lines of a file by reading in their byte offsets, shuffling the offsets, and then printing output by seeking through the file to the shuffled offsets. It uses 8 bytes for each 64-bit offset, thus 16 GB for a two billion-line input.

It won't be fast, but on a system with enough memory, sample will shuffle files that are large enough to cause GNU shuf to fail. Further, it uses mmap routines to try to minimize the I/O expense of a second pass through your file. It also has a few other options; see --help for more details.

By default, this program will sample without replacement and shuffle by single lines. If you want to shuffle with replacement, or if your input is in FASTA, FASTQ or another multi-line format, you can add some options to adjust how sampling is done. (Or you can apply an alternative approach, which I link to in a Perl gist below, but sample addresses these cases.)

If your FASTA sequences are on every two lines, that is, they alternate between sequence header on one line and sequence data on the next, you can still shuffle with sample, and with half the memory, since you are only shuffling half the number of offsets. See the --lines-per-offset option; you'd specify 2, for instance, to shuffle pairs of lines.

In the case of FASTQ files, their records are split every four lines. You can specify --lines-per-offset=4 to shuffle a FASTQ file with a fourth of the memory required to shuffle a single-line file.

Alternatively, I have a gist here written in Perl, which will sample sequences without replacement from a FASTA file without regard for the number of lines in a sequence. Note that this isn't exactly the same as shuffling a whole file, but you could use this as a starting point, since it collects the offsets. Instead of sampling some of the offsets, you'd remove line 47 that sorts shuffled indices, then use file seek operations to read through the file, using the shuffled-index list directly.

Again, it won't be fast, because you are jumping through a very large file out of order, but storing offsets is much less expensive than storing whole lines, and adding mmap routines could help a little with what is essentially a series of random access operations. And if you are working with FASTA, you'll have still fewer offsets to store, so your memory usage (excepting any relatively insignificant container and program overhead) should be at most 8 GB — and likely less, depending on its structure.

Alex Reynolds
  • 95,983
  • 54
  • 240
  • 345
  • Hey Alex. I think I've seen your posts on BioStar. I am looking to shuffle ~2 billion reads (fasta, not fastq). My server has 64gig of ram, so I can reserve 16gb for this. Thanks for the sampling software. – daemonk Jun 30 '14 at 14:57
  • Please see the edit to my answer, which touches on shuffling FASTA. – Alex Reynolds Jun 30 '14 at 15:05
  • I have a question about your reservoir sampling implementation. When using Algorithm R, if the sample size were the same size as the total population, everything is sampled, but the order of the samples are not shuffled. For your implementation of reservoir sampling to be capable of shuffling the entire file (thus randomly sampling the entire file), it must not be using Algorithm R's initial fill procedure right? – CMCDragonkai Apr 13 '18 at 01:50
7

How about:

import mmap
from random import shuffle

def find_lines(data):
    for i, char in enumerate(data):
        if char == '\n':
            yield i 

def shuffle_file(in_file, out_file):
    with open(in_file) as f:
        data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        start = 0
        lines = []
        for end in find_lines(data):
            lines.append((start, end))
            start = end + 1
        shuffle(lines)

        with open(out_file, 'w') as out:
            for start, end in lines:
                out.write(data[start:end+1])

if __name__ == "__main__":
    shuffle_file('data', 'result')

This solution should only ever store all the file offsets of the lines in the file, that's 2 words per line, plus container overhead.

Jamie Cockburn
  • 7,379
  • 1
  • 24
  • 37
  • I'm wondering if it is possible to use Fisher Yates on a memory mapped data structure (like an numpy array) to shuffle an out-of-memory file? Just do the exchange directly on the file. The OS should take care of paging in and paging out memory. Suppose y;ou want to not destructively edit the input file. One could just use the variant here: https://en.wikipedia.org/wiki/Reservoir_sampling#Relation_to_Fisher-Yates_shuffle – CMCDragonkai Apr 13 '18 at 01:56
5

You may check my HugeFileProcessor tool. It's similar to @Alex-Reynolds's sample, but should be significantly faster as there would be no seeks.

Here are the details on shuffling implementation. It requires specifying batchSize - number of lines to keep in RAM when writing to output. The more is the better (unless you are out of RAM), because total shuffling time would be (number of lines in sourceFile) / batchSize * (time to fully read sourceFile). Please note that the program shuffles whole file, not on per-batch basis.

The algorithm is as follows.

  1. Count lines in sourceFile. This is done simply by reading whole file line-by-line. (See some comparisons here.) This also gives a measurement of how much time would it take to read whole file once. So we could estimate how many times it would take to make a complete shuffle because it would require Ceil(linesCount / batchSize) complete file reads.

  2. As we now know the total linesCount, we can create an index array of linesCount size and shuffle it using Fisher–Yates (called orderArray in the code). This would give us an order in which we want to have lines in a shuffled file. Note that this is a global order over the whole file, not per batch or chunk or something.

  3. Now the actual code. We need to get all lines from sourceFile in a order we just computed, but we can't read whole file in memory. So we just split the task.

    • We would go through the sourceFile reading all lines and storing in memory only those lines that would be in first batchSize of the orderArray. When we get all these lines, we could write them into outFile in required order, and it's a batchSize/linesCount of work done.
    • Next we would repeat whole process again and again taking next parts of orderArray and reading sourceFile from start to end for each part. Eventually the whole orderArray is processed and we are done.

Why it works?

Because all we do is just reading the source file from start to end. No seeks forward/backward, and that's what HDDs like. File gets read in chunks according to internal HDD buffers, FS blocks, CPU cahce, etc. and everything is being read sequentially.

Some numbers

On my machine (Core i5, 16GB RAM, Win8.1, HDD Toshiba DT01ACA200 2TB, NTFS) I was able to shuffle a file of 132 GB (84 000 000 lines) in around 5 hours using batchSize of 3 500 000. With batchSize of 2 000 000 it took around 8 hours. Reading speed was around 118000 lines per second.

Mikhail
  • 1,223
  • 14
  • 25
  • This tool would be interesting if it wasn't dependent on Windows. – Bob May 02 '17 at 16:01
  • It is not dependent on Windows, it's dependent on .Net. So just use Mono on Linux. I've tested it under a number of Mono versions, and it's working fine. Actually I'm using it under Mono. – Mikhail May 03 '17 at 03:20
2

I think the simplest in your case is to do a recursive shuffle&split - shuffle - merge. You define two numbers : the number of files you want to split one file in : N (typicaly between 32 and 256), and the size at which you can directly shuffle in memory M (typicaly about 128 Mo). Then you have in pseudo code :

def big_shuffle(file):
    if size_of(file) < M :
        memory_shuffle(file)
    else:
        create N files
        for line in file:
            write_randomly_to_one_of_the_N_files
        for sub_file in (N_files):
            big_shuffle(file)
        merge_the_N_files_one_line_each

As each of the sub-file is shuffled, you should have no bias.

It will be far less fast than Alex Reynolds solution (because a lot of disk io), but your only limit will be disk space.

Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
0

You could make an iterator that gives permutations. You offset your read into a file by the amount it gives. Because the iterator gives permutations, you will never read the same data twice.

All the permutations of a set of N elements can be generated by transpositions, which are permutations that swap the 0th and the ith element (assuming indexing from 0) and leave all other elements in their place. So you can make a random permutation by composing some randomly chosen transpositions. Here's an example written in Python:

import random

class Transposer:
    def __init__(self,i):
        """
        (Indexes start at 0)
        Swap 0th index and ith index, otherwise identity mapping.
        """
        self.i = i
    def map(self,x):
        if x == 0:
            return self.i
        if x == self.i:
            return 0
        return x

class RandomPermuter:
    def __init__(self,n_gens,n):
        """
        Picks n_gens integers in [0,n) to make transposers that, when composed,
        form a permutation of a set of n elements. Of course if there are an even number of drawn
        integers that are equal, they cancel each other out. We could keep
        drawing numbers until we have n_gens unique numbers... but we don't for
        this demo.
        """
        gen_is = [random.randint(0,n-1) for _ in range(n_gens)]
        self.trans = [Transposer(g) for g in gen_is]
    def map(self,x):
        for t in self.trans:
            x = t.map(x)
        return x

rp = RandomPermuter(10,10)

# Use these numbers to seek into a file
print(*[rp.map(x) for x in range(10)])
mondaugen
  • 425
  • 5
  • 12
0

I had to solve the above problem for shuffling a text file that was massive. This way, the script will place the items in buffers. Further, there's no object made in-between opening original file and writing to new file, which means this script wont use much RAM whatsoever. You also save on iterating over the file once instead of multiple iterations over the files/objects. Once these smaller randomized files are made, re-combining these files is simple. Just read each file into the new file. Python Code:

import random
import io
from tqdm import tqdm

file_in = "file\\to\\randomize"
file_out = "base\\path\\to\\place\\randomized\\files\\"
files_out = []

NUM_OF_FILES = 1_000

for i in range(NUM_OF_FILES):
    f_ = file_out + str(i)
    files_out.append(io.open(f_, 'w', encoding='utf-8'))

with io.open(file_in, 'r', encoding='utf-8') as source:
    for f in tqdm(source):
        files_out[random.randint(0, NUM_OF_FILES - 1)].write(f)
    for i in range(NUM_OF_FILES):
        files_out[i].close()

for i in range(NUM_OF_FILES):
    f_ = file_out + str(i)
    data = []
    with io.open(f_, 'r', encoding='utf-8') as file:
        data = [(random.random(), line) for line in tqdm(file)]
    data.sort()
    with io.open(f_, 'w', encoding='utf-8') as file:
        for _, line in tqdm(data):
            file.write(line)
Adam Boyle
  • 99
  • 1
  • 5
0

Seems same question as How can I shuffle a very large list stored in a file in Python?

If you can use java or are willing to translate some code, I suggest the solution using ImmutableList from https://tracinsy.ewi.tudelft.nl/pubtrac/Utilities/wiki/utilities. If your original file has random access (so that you can get item N), then you don't even need to create that 2nd shuffled file.

0

I wrote a version similar to @serge-ballesta 's suggestion:

import tempfile
import os
import random

def shuffle(filename_in, filename_out, memory_limit, file_split_count, 
            depth=0, debug=False):
    if os.path.getsize(filename_in) < memory_limit:
        if debug: print(" " * depth, f"Level {depth + 1}",
            "Shuffle in memory...")
        shuffle_in_memory(filename_in, filename_out)
    else:
        if debug: print(
            " " * depth, f"Level {depth + 1}",
            f"{os.path.getsize(filename_in)} is too big;",
            f"Split into {file_split_count} files..."
        )
        # Split the big file into smaller files
        temp_files = [tempfile.NamedTemporaryFile('w+', delete=False)
                      for i in range(file_split_count)]
        for line in open(filename_in):
            random_index = random.randint(0, len(temp_files) - 1)
            temp_files[random_index].write(line)

        # Now we shuffle each smaller file
        for temp_file in temp_files:
            temp_file.close()
            shuffle(temp_file.name, temp_file.name, memory_limit, 
                    file_split_count, depth+1, debug)

        # And merge back in place of the original
        if debug: print(" " * depth, f"Level {depth + 1}", 
            "Merge files...")
        merge_files(temp_files, filename_out)

This works very similar to the merge-sort algorithm. You'll also need shuffle_in_memory and merge_files:

def shuffle_in_memory(filename_in, filename_out):
    # Shuffle a file, line-by-line
    with open(filename_in) as fp:
        lines = fp.readlines()
    # Randomize them in place:
    random.shuffle(lines)
    # Write the new order out:
    with open(filename_out, "w") as fp:
        fp.writelines(lines)

def merge_files(temp_files, filename_out):
    with open(filename_out, "w") as fp_out:
        for temp_file in temp_files:
            with open(temp_file.name) as fp:
                line = fp.readline()
                while line:
                    fp_out.write(line)
                    line = fp.readline()

I wrote this up in an article where I also explored other methods.

Doug Blank
  • 2,031
  • 18
  • 36