sorting/merging a binary file in-place, which translates to sorting an array that contains two parts - both of which are already sorted

Question

first and foremost, I'd like to stress out that this code would never be ran on production (I am fully aware that there are dedicated solutions called time-series database(s)). I'm just doing this to keep my brain active with solving some fun project. Here's the problem I'm trying to solve.

Imagine I'm gathering a list of samples. Each of these samples is actually a structure (more on which later) that I can serialize and deserialize into a stream of bytes. The idea is as follows: I start with an empty file and record first samples. Each portion of the samples can easily fit into memory (let's assume that there are 1000 of them each time I'd like to insert them). So because I start with an empty file, I would sort the new samples in memory and then simply write them into the file. On the second iteration, I'd do the same, but I'd append the new samples to the end of the file. Here comes the tricky part - I need to sort the file. I thought about couple of ideas:

reading file back, sorting it in memory and writing it back to disk (which I would not like to do, given that the file can grow in size)
chunking the file - so that even if there are.. say, 1_000_000 samples recorded, I'd end up with 10 files with 100_000 samples in each of the chunks (which could be beneficial I suppose because I could very easily read the whole file back into memory, sort it there and save it back to the disk and I could introduce some modulo function to determine which chunk should I use)
sorting the file in place. That's the path I went trough since it seemed the least resource-intensive solution.

Assumptions:

the file on the disk, prior to inserting new samples is already sorted
I have a total control over the new samples, therefore I can sort them in memory as well.

here's a piece of code I wrote for treating a file like a pseudo-array:

import os
import struct
from datetime import datetime
from typing import BinaryIO


class FileWrapper:
    def __init__(self, file: BinaryIO, fmt: str):
        self.file = file
        self.fmt = fmt
        self.sizeof_struct = struct.calcsize(fmt)

    def __getitem__(self, index: int) -> int:
        buffer = self.read_at(index)
        _data = struct.unpack(self.fmt, buffer)
        # timestamp is always the first value
        return _data[0]

    def __len__(self) -> int:
        self.file.seek(0, os.SEEK_END)
        return self.file.tell() // self.sizeof_struct

    def __setitem__(self, index: int, content: bytes) -> None:
        self.file.seek(index * self.sizeof_struct)
        self.file.write(content)

    ## helper method
    def read_at(self, index: int) -> bytes:
        self.file.seek(index * self.sizeof_struct)
        buffer = self.file.read(self.sizeof_struct)
        return buffer

    def swap(self, i: int, j: int) -> None:
        if i == j:
            return

        buffer_i = self.read_at(i)
        buffer_j = self.read_at(j)

        self[i] = buffer_j
        self[j] = buffer_i

I can use this class in the following way:

with open("input.bin", "r+b") as input_file:
    fileWrapper = FileWrapper(input_file, "<IIdI")

    ...

because of how that's implemented, I can obtain the timestamp of the given sample via __getitem__ and treat the file as an sort of array (a random-access file for the lack of better word).

So before I started the tests on an actual file, I wanted to try this out on a "pure" array that would contain a list of integers. Needless to say, there is a magnitude of options for sorting algorithms, however most of them don't sort in-place. What I wanted to avoid was a situation where I'd copy huge chunks of the file around. Here's the algorithm I came up with using a piece of paper (though I'm affraid it's a simple bubble sort)

def merge(arr, start, mid, end):
    start2 = end
    last_swap = 0

    swaps = 0
    lookups = 0

    while start2 > mid:
        lookups += 1

        if arr[start] <= arr[start2]:
            start += 1
        else:
            swaps += 1
            arr[start], arr[start2] = arr[start2], arr[start]
            if last_swap == 0:
                last_swap = start
            start += 1

        if start >= start2:
            start = last_swap
            start2 -= 1

    n = len(arr)
    print(f"lookups: {lookups}; swaps: {swaps} for n={n}")

What worries me about this is the number of lookups and swaps. Here's the example invocation and some stats:

data = [0, 1, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 13, 13, 13, 40]
new_items = [12, 5, 5, 5, 1, 1, 1, 0, 1, 18, 18, 8, 20]
new_items.sort(reverse=True)
num_new_items = len(new_items)
data.extend(new_items)
pa(data, num_new_items, "[input]")

size = len(data) - 1
merge(data, 0, size - num_new_items, size)
pa(data, num_new_items, "[final]")

[ ... ]
lookups: 326; swaps: 90 for n=33
[final]: [  0  0  1  1  1  1  1  1  2  3  5  5  5  5  6  7  8  8  9 10] | [ 11 12 12 13 13 13 13 13 13 18 18 20 40] | start=None; start2=None

for 100_000 random integers and 1_000 "new" points the stats are:
lookups: 95742262; swaps: 9716 for n=101000

which is really, really bad

Is it possible that it could damage the hard drive ? Or would it just be desparately slow ? There's one more approach I thought of which would be the following:

1/ iterate over already sorted file, and iterate over the samples 2/ if the n-th sample is less than n-th sample from the file, then 3a/ create a new file, copy bytes from the input file from 0 to the offset at position n 3b/ write samples until the value is greater than n+1'th position in the original file 4/ repeat

of course it violates the basic assumption that everything must be done in-place. Though maybe it would be better for the hardware ? I was basically worried that if a file would be fairly large, say, 4 gigabytes worth of binary data then I'd end up copying huge blocks of the file over the hard drive.

initially I tried using radix sort, however I quickly realized that it would obliterate the hard drive as it moves the data a lot - that's the nature of it. I also tried insertion sort, merge sort and quicksort.

I'd like to stress out that (to best of my knowledge) this is not a typical sorting question. Basically, when we sort an array of integers in memory, both lookups and swaps are essentially "free", however when the disk is taken into account, I believe the nature is somewhat different.

Why reinventing the wheel? Use stdlib [`heapq.merge`](https://docs.python.org/3/library/heapq.html#heapq.merge) — wim, Mar 28 '23 at 00:47
but wouldn't heapq.merge mean that I'd need to read the whole file into the memory to covert it into the iterable ? I am not trying to reinvent the wheel. For "plain" sorting [just an array of integers] there's gazilion of algorithms, however I wasn't as successful in finding an algorithm of sorting a binary file in-place — toudi, Mar 28 '23 at 00:58
No, `heapq.merge` does not eagerly consume the iterator(s), it is space efficient — wim, Mar 28 '23 at 01:06
I also found this answer, just recently: https://stackoverflow.com/questions/2126219/how-to-merge-two-sorted-integer-array-in-place-using-on-time-and-o1-space-co — toudi, Mar 28 '23 at 01:06

score 0 · Answer 1 · answered Mar 28 '23 at 19:22

after a bit of manual testing (pieces o paper with digits on them :)) I was able to amend the algorithm into the following form:

def mergeInPlace(array, numNewItems, visualize=False):
    # print(array)
    arrayEnd = len(array) - 1
    rightMinimumIndex = arrayEnd - numNewItems + 1

    lookups = 0
    swaps = 0

    rightMinimum = array[rightMinimumIndex]
    index = 0

    lastSwapIndex = -1

    print(f"firstNewPoint={rightMinimumIndex}; rightMinimum: {rightMinimum}")

    while index < arrayEnd:
        # print(f"current minimum: {array[index]}")
        while array[index] < rightMinimum:
            index += 1
            if index > arrayEnd:
                break
            lookups += 1

        # lastSwapIndex = index

        if index <= arrayEnd and array[index] > array[rightMinimumIndex]:
            swaps += 1
            # print(f"swap {index} with {rightMinimumIndex}")
            if visualize:
                swaps_array = ["  " for _ in range(arrayEnd + 1)]
                swaps_array[index] = "v "
                swaps_array[rightMinimumIndex] = "v "
                print("[" + " ".join(swaps_array) + "]")
                print(array)
            array[index], array[rightMinimumIndex] = (
                array[rightMinimumIndex],
                array[index],
            )
            if visualize:
                print(array)
            if lastSwapIndex == -1:
                lastSwapIndex = index
            # lastSwapIndex = index
            # print(f"set lastSwapIndex to {lastSwapIndex}")

        index += 1

        if index >= rightMinimumIndex:
            rightMinimumIndex += 1

            if rightMinimumIndex > arrayEnd:
                break

            rightMinimum = array[rightMinimumIndex]
            index = lastSwapIndex - 1
            # index = 0
            # print(f"roll over to {index}")
            lastSwapIndex = -1

    print(f"lookups: {lookups}; swaps: {swaps}")

and the stats have improved into (for the example array of size=33)

lookups: 26; swaps: 99

and for the 100_000 items and 1_000 new items:

lookups: 4328770; swaps: 9874

sorting/merging a binary file in-place, which translates to sorting an array that contains two parts - both of which are already sorted

1 Answers1