Given a 10GB file full of names, the task is to a sort process on the file, and the system has only 2GB of RAM. How will you do it ?
Is it possible to load or process the 10GB file with 2GB RAM?
Given a 10GB file full of names, the task is to a sort process on the file, and the system has only 2GB of RAM. How will you do it ?
Is it possible to load or process the 10GB file with 2GB RAM?
The solution is to divide the file into parts that will fit into memory, sort the parts individually and write them to temporary files, then merge the temporary files.
For a more detailed description, see the Wikipedia article on Merge Sort.
For the record, the solution of using virtual memory will not scale. Unless you design your sort algorithm really carefully you will push the system into catastrophic virtual memory thrashing.
Sorting a compressed file won't work either.
Is it possible to load a 10GB file with 2GB RAM ?
Interpreting that as a general question, the answer is No. If you have a problem that requires (significantly) more RAM than you have, then you need to employ an algorithm that divides the problem into smaller ones. If you can't find an algorithm that works that way, you are in for a hard time.
Is it possible to process a 10GB file with 2GB RAM ?
Yes ... provided that your processing doesn't require you hold the entire 10GB file in memory at the same time.
As per comments to the original question, it is a 10GB file of single lines containing names. This is easily achieved by using coreutils' sort, which should be present on virtually every Linux system:
sort yourFile
Now, given the file contains the names in the form FirstName LastName
, this would sort the file by first names first, which is propably not what you want. In this case, you have to tell sort that you want to sort by the second (and subsequent fields in special cases):
sort -k 2 yourFile
Next, you want to make sure that the case is ignored (in case of typos or special surnames) and that we have a sort order equalling the ones used in dictionaries. The former is achieved using the -f
flag, the latter with the -d
flag:
sort -k 2 -f -d yourFile
Last but not least, you might want to write the result to a file, which can be done using the -o
flag (For the curious: I found this to be faster than output redirection with >
):
sort -k 2 -o sortedFile -f -d yourFile
As for RAM usage: I haven't tested it, but as far as I can see it should not be much more than 1Mb. However, it may take quite a while to get results.
I don't know if it will help, but with the Haskell language you can handle files much bigger than your system's RAM using lazy evaluation, meaning that
Data is only read from the Handle as the elements (characters) of the list are processed. As elements of the String are no longer used, Haskell's garbage collector automatically frees that memory.
Bryan O'Sullivan, Don Stewart and John Goerzen explain how it is done in their book Real World Haskell.