1

I need to sort large text file that is separated by a newline character.

I need to assume that input data is too big to fit into main memory, meaning that I can only read and store one line of the file in the memory. Therefore, I can't make a list or array for to use in a classic sorting algorithm (like mergesort, quicksort etc.) and because of that I'm stuck.

How could one approach that kind of a problem?

meraki_1
  • 25
  • 5
  • Does this answer your question? [How can I sort a very large log file, too large to load into main memory?](https://stackoverflow.com/questions/41702262/how-can-i-sort-a-very-large-log-file-too-large-to-load-into-main-memory) – mkrieger1 Oct 12 '22 at 14:16
  • You can simulate a list by concatenating lines together with a null byte (or some other delimiter) and therefore have access to multiple lines while storing them as a single string. – CJK Oct 12 '22 at 16:48
  • With one line in memory, and one line in the reader you only have two lines to work with at one time. Simple sorting methods, such as insertion sort or bubble sort can work that way, though you will need to do a lot of reading and writing during the sort. Maybe just buy more memory. – rossum Oct 12 '22 at 17:57
  • Resources limiting operations to just *one line of the file*, concentrate on sorting algorithms *not* comparing keys. – greybeard Oct 13 '22 at 00:13
  • @rossum but bubble sort or insertion sort still create an array or list in my memory – meraki_1 Oct 13 '22 at 09:03
  • @greybeard could you please explain your logic further? – meraki_1 Oct 13 '22 at 09:03
  • @CJK and how would one do that? Can you give me an example maybe (link or code in the comments)? – meraki_1 Oct 13 '22 at 09:06
  • I cannot *explain* how comparing two keys is not possible when you can only have a single one in memory. – greybeard Oct 13 '22 at 09:21
  • 1
    It's quite a jump from `input data is too big to fit into main memory` to *I can only read and store **one** line of the file in the memory*. – greybeard Oct 13 '22 at 09:23
  • Please state/list the operations that *are* possible. – greybeard Oct 13 '22 at 09:24
  • @greybeard I need to assume that input data is too big to fit into main memory. I have two files that contain multiple entries separated by a newline character. I need to sort and then merge them into one new file on the same ID. That's why I think that by 'input data is too big to fit into main memory' means what I said, or am I wrong? If I am, why? Thank you. – meraki_1 Oct 13 '22 at 11:37

1 Answers1

0

In practice, under normal circumstances, just use the Unix sort command.

LC_ALL=C sort file_in.txt > file_out.txt

For a very large file, I'd go distributed use the sort build into a mapreduce. See How does the MapReduce sort algorithm work? for how that one works. One tip, if you go distributed, stay distributed. That is, the input file, output file, and operation should all be on distributed filesystems so you nowhere have a bottleneck on any one machine.

Exactly once have I faced a situation where these two would not work. The situation was where I needed to sort and organize a dataset that was coming from a database, but the data was too big to sort in the database, and the machine that I was on did not have space for the raw dataset.

I solved that one with a mergesort where all chunks above a certain size were kept in compressed files. The key logic was something like this:

for row in big query:
    last_chunk = new chunk from row
    chunk = None
    while 0 < len(chunks):
        chunk = chunks.pop()
        if chunk.size < 1.2 * last_chunk.size:
            last_chunk = merge(chunk, last_chunk)
        else:
            break
    if chunk is not None:
        chunks.append(chunk)
    chunks.append(last_chunk)
while 1 < len(chunks):
    chunks.append(merge(chunks.pop(), chunks.pop()))

And then in the merge logic I had the reasoning about whether a chunk should wind up in memory or in a compressed file, how to get the rows, and how to write them.

(My problem was not so simple as this, because I was grouping, summing, merging, etc. Basically duplicating a report I couldn't run in the database because it didn't have enough working memory to run it.)

Those three cover every situation I've personally encountered with large files.


Here is a not great but working implementation of an external file-based mergesort.

First you need demo.txt that we want to sort:

hello
world
greetings
earthlings

And now Python code to print it in sorted order:

from tempfile import TemporaryFile

class FileBuffer():
    def __init__ (self):
        self.current_line = None
        self.fh = TemporaryFile()
        self.state = 'writing'
        self.size = 0

    def add (self, line):
        if self.state != 'writing':
            raise Exception(f"Cannot write to FileBuffer in state {self.state}")
        self.size += len(line)
        self.fh.write(bytes(line, encoding='utf-8'))

    def finish_writing (self):
        self.fh.seek(0, 0)
        self.state = 'reading'
        self.fetch()
        return self.current_line

    def fetch (self):
        if self.state != 'reading':
            raise Exception(f"Cannot read from FileBuffer in state {self.state}")
        self.current_line = bytes.decode(self.fh.readline())
        if self.current_line == '':
            self.current_line = None
            self.state = 'done'
        return self.current_line

    def __iter__(self):
        return self

    def __next__(self):
        line = self.current_line
        if line is None:
            raise StopIteration
        else:
            self.fetch()
            return line
class BufferedSort():
    def __init__ (self):
        self.buffers = []

    def merge (self):
        buffer1 = self.buffers.pop()
        buffer2 = self.buffers.pop()
        new_buffer = FileBuffer()

        while buffer1.current_line is not None and buffer2.current_line is not None:
            if buffer1.current_line < buffer2.current_line:
                new_buffer.add(buffer1.current_line)
                buffer1.fetch()
            else:
                new_buffer.add(buffer2.current_line)
                buffer2.fetch()

        while buffer1.current_line is not None:
            new_buffer.add(buffer1.current_line)
            buffer1.fetch()

        while buffer2.current_line is not None:
            new_buffer.add(buffer2.current_line)
            buffer2.fetch()

        new_buffer.finish_writing()
        self.buffers.append(new_buffer)

    def add (self, line):
        buffer = FileBuffer()
        buffer.add(line)
        buffer.finish_writing()
        self.buffers.append(buffer)
        while 1 < len(self.buffers) and self.buffers[-2].size < 1.2 * self.buffers[-1].size:
            self.merge()

    def finish_writing(self):
        while 2 < len(self.buffers):
            self.merge()

    def sorted_buffer(self):
        self.finish_writing()
        if len(self.buffers):
            return self.buffers[0]
        else:
            buffer = FileBuffer()
            buffer.state = 'done'
            return buffer

    def __iter__(self):
        return self.sorted_buffer()

sorter = BufferedSort()
with open("demo.txt") as fh:
    for line in fh:
        sorter.add(line)
for line in sorter:
    print(line, end="")
btilly
  • 43,296
  • 3
  • 59
  • 88
  • My problem is that if I go with for example a mergesort, because I must assume that input data is too big to fit into main memory (can only read one line at the time), my "chunks" would be made of one line right? Which is the reason I didn't go with mergesort or any other classic algorithm. – meraki_1 Oct 13 '22 at 09:15
  • @meraki_1 You need to be able to hold 2 lines in memory at once for the merge operation. But that's it. With the logic I gave your chunks will be of size 1, 2, 4, 8, 16, etc. But they can be written to files, then re-read when you merge. – btilly Oct 13 '22 at 10:18
  • Thank you. I am new to this kind of problems, so can I ask you please to explain how they "can be written to files, then re-read when you merge"? If you can maybe give an example like in your answer, that would be great. – meraki_1 Oct 13 '22 at 11:40
  • @meraki_1 I added a concrete implementation of an external mergesort. It is not very inefficient but is a working demo. – btilly Oct 13 '22 at 17:30