3

What's a good algorithm to shuffle the lines in a file without reading the whole file in advance?

I guess it would look something like this: Start reading the file line by line from the beginning, storing the line at each point and decide if you want to print one of the lines stored so far (and then remove from storage) or do nothing and proceed to the next line.

Can someone verify / prove this and/or maybe post working (perl, python, etc.) code?

Related questions, but not looking at memory-efficient algorithms:

Community
  • 1
  • 1
Frank
  • 64,140
  • 93
  • 237
  • 324
  • Are all of the lines the same size? – torak Jul 29 '10 at 23:39
  • That's not going to produce very random output, though - it'll guarantee that the first few lines of the "shuffled" file will all come from the first few lines of the original file. Is the requirement *really* to begin shuffling before you've read the whole file? Or do you just want to avoid reading the entire file into memory at once (so scanning the file is OK as long as you don't keep the whole thing in memory at once)? – Dean Harding Jul 29 '10 at 23:39
  • @torak: No, the lines have different lengths. – Frank Jul 29 '10 at 23:41
  • @Dean: Well, that depends on the criterion used to decide to print one of the previously stored ones or wait until later. That probability has to be such that the bug you mentioned will not occur. And yeah, the point is to avoid having to keep the whole file in memory at once. (I'd say it is acceptable if the algorithm indeed has to read the whole file into memory in some pathological cases that occur with very low probability.) – Frank Jul 29 '10 at 23:44
  • In that case aren't you going to need to read lines 1 to (n - 1) before you read line n? – torak Jul 29 '10 at 23:44
  • @dehmann: Do you know the number of lines in the file? – Jacob Jul 29 '10 at 23:45
  • @Jacob: No, the number is not known in advance. – Frank Jul 29 '10 at 23:46
  • @torak: Yes, lines 1 to (n-1) have to be read before line n can be read. – Frank Jul 29 '10 at 23:47
  • 1
    Without knowing the number of lines I don't think its realistically doable. Without that knowledge how can you determine the probability that a line occupies a given position. – torak Jul 29 '10 at 23:56
  • You could use memory-mapped files. – SigTerm Jul 30 '10 at 00:05
  • @dehmann: I've got a solution which I think will work. – Jacob Jul 30 '10 at 00:26
  • Do you need to output the entire file, or just a random, shuffled subset of it? – Nick Johnson Jul 30 '10 at 10:17
  • Not a subset, I need to output the whole shuffled file. – Frank Jul 30 '10 at 14:20

3 Answers3

4

I cannot think of a way to randomly do the entire file without somehow maintaining a list of what has already been written. I think if I had to do a memory efficient shuffle, I would scan the file, building a list of offsets for the new lines. Once I have this list of new line offsets, I would randomly pick one of them, write it to stdout, and then remove it from the list of offsets.

I am not familiar with perl, or python, but can demonstrate with php.

<?php
$offsets = array();

$f = fopen("file.txt", "r");
$offsets[] = ftell($f);
while (! feof($f))
{
  if (fgetc($f) == "\n") $offsets[] = ftell($f);
}

shuffle($offsets);
foreach ($offsets as $offset)
{
  fseek($f, $offset);
  echo fgets($f);
}
fclose($f);
?>

The only other option I can think of, if scanning the file for new lines is absolutely unacceptable, would be (I am not going to code this one out):

  1. Determine the filesize
  2. Create a list of offsets and lengths already written to stdout
  3. Loop until bytes_written == filesize
  4. Seek to a random offset that is not already in your list of already written values
  5. Back up from that seek to the previous newline or start of file
  6. Display that line, and add it to the list of offsets and lengths written
  7. Go to 3.
Brandon Horsley
  • 7,956
  • 1
  • 29
  • 28
3

The following algorithm is linear in the number of lines in your input file.

Preprocessing:

  1. Find n (the total number of lines) by scanning for newlines (or whatever) but store the character number signifying the beginning and end of each line. So you'd have 2 vectors, say, s and e of size n where characters numbering from s[i] to e[i] in your input file is the i th line. In C++ I'd use vector.

  2. Randomly permute a vector of integers from 1 to n (in C++ it would be random_shuffle) and store it in a vector, say, p (e.g. 1 2 3 4 becomes p = [3 1 4 2]). This implies that line i of the new file is now line p[i] in the original file (i.e. in the above example, the 1st line of the new file is the 3rd line of the original file).

Main

  1. Create a new file

  2. Write the first line in the new file by reading the text in the original file between s[p[0]] and e[p[0]] and appending it to the new file.

  3. Continue as in step 2 for all the other lines.

So the overall complexity is linear in the number of lines (since random_shuffle is linear) if you assume read/write & seeking in a file (incrementing the file pointer) are all constant time operations.

Community
  • 1
  • 1
Jacob
  • 34,255
  • 14
  • 110
  • 165
0

You can create an array for N strings and read the first N lines of the file into this array. For the rest you read one line, select by random one of the lines from the array, and replace the this string by the newly read string. Also you write out the string from the array to the output file. This has the advantage that you don't need to iterate over the file twice. The disadvantage is that it will not create a very random output file, especially when N is low (for example this algorithm can't move the last line more than N lines up in the output.)

Edit

Just an example in python:

import sys
import random

CACHE_SIZE = 23

lines = {}

for l in sys.stdin: # you can replace sys.stdin with xrange(200) to get a test output
    i = random.randint(0, CACHE_SIZE-1)
    old = lines.get(i)
    if old:
        print old,
    lines[i] = l

for ignored, p in lines.iteritems():
    print p,
Rudi
  • 19,366
  • 3
  • 55
  • 77