All but one of these ideas use O(N) memory—but if you use an array.array
or numpy.ndarray
we're talking around N*4 bytes, which is significantly smaller than the whole file. (I'll use a plain list for simplicity; if you need help converting to a more compact type, I can show that too.)
Using a temporary database and an index list:
with contextlib.closing(dbm.open('temp.db', 'n')) as db:
with open(path) as f:
for i, line in enumerate(f):
db[str(i)] = line
linecount = i
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(db[str(i)])
os.remove('temp.db')
This is 2N single-line disk operations, and 2N single-dbm-key disk operations, which should be 2NlogN single-disk-disk-operation-equivalent operations, so the total complexity is O(NlogN).
If you use a relational database like sqlite3
instead of a dbm, you don't even need the index list, because you can just do this:
SELECT * FROM Lines ORDER BY RANDOM()
This has the same time complexity as the above, and the space complexity is O(1) instead of O(N)—in theory. In practice, you need an RDBMS that can feed you a row at a time from a 100M row set without storing that 100M on either side.
A different option, without using a temporary database—in theory O(N**2), but in practice maybe faster if you happen to have enough memory for the line cache to be helpful:
with open(path) as f:
linecount = sum(1 for _ in f)
shuffled = random.shuffle(range(linecount))
with open(path + '.shuffled', 'w') as f:
for i in shuffled:
f.write(linecache.getline(path, i))
Finally, by doubling the size of the index list, we can eliminate the temporary disk storage. But in practice, this might be a lot slower, because you're doing a lot more random-access reads, which drives aren't nearly as good at.
with open(path) as f:
linestarts = [f.tell() for line in f]
lineranges = zip(linestarts, linestarts[1:] + [f.tell()])
shuffled = random.shuffle(lineranges)
with open(path + '.shuffled', 'w') as f1:
for start, stop in shuffled:
f.seek(start)
f1.write(f.read(stop-start))