2

I have been looking for an idea of shuffling files on disk without being loaded into the memory. At the beginning, I doubt such an approach exists but recently I have come across this answer. Since this answer is not supported or voted, I would love to know if this code really does shuffling the file without loading into the memory. If so, HOW does that happen? I don't see how a file can be shuffled without loading it first into the memory!

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
Dave
  • 554
  • 1
  • 6
  • 13

1 Answers1

4

I assume you're talking about shuffling lines in a text file.

I don't know if the linked answer by Jamie Cockburn works, but it looks totally reasonable to me. The idea is the following:

  • mmap doesn't load the whole file into memory but allows you to access its random parts by indexing via "from" and "to" bytes, as if it were a list loaded into memory
  • You do go twice through the file, but you don't load the file's content into memory
  • First time you pass through the file, you watch out for line breaks \n and store not the line but the byte numbers (or indices) corresponding to the addresses of each line's start and end. You effectively store two numbers per line
  • You now shuffle the list of indices called lines (remember, it contains only pairs (int, int))
  • Now you open a new file for writing, and iterate over the shuffled indices; for each index pair, you read a single line data[start:end+1] from the original file into memory and write it into the new file. You don't keep the line in memory for longer than this single operation.

This approach requires a memory amount linear in the number of lines in the input file. It can be much smaller than reading the whole file if the average line length is bigger than the amount of memory you need to store two integers.

Pavel
  • 7,436
  • 2
  • 29
  • 42