first and foremost, I'd like to stress out that this code would never be ran on production (I am fully aware that there are dedicated solutions called time-series database(s)). I'm just doing this to keep my brain active with solving some fun project. Here's the problem I'm trying to solve.
Imagine I'm gathering a list of samples. Each of these samples is actually a structure (more on which later) that I can serialize and deserialize into a stream of bytes. The idea is as follows: I start with an empty file and record first samples. Each portion of the samples can easily fit into memory (let's assume that there are 1000 of them each time I'd like to insert them). So because I start with an empty file, I would sort the new samples in memory and then simply write them into the file. On the second iteration, I'd do the same, but I'd append the new samples to the end of the file. Here comes the tricky part - I need to sort the file. I thought about couple of ideas:
- reading file back, sorting it in memory and writing it back to disk (which I would not like to do, given that the file can grow in size)
- chunking the file - so that even if there are.. say, 1_000_000 samples recorded, I'd end up with 10 files with 100_000 samples in each of the chunks (which could be beneficial I suppose because I could very easily read the whole file back into memory, sort it there and save it back to the disk and I could introduce some modulo function to determine which chunk should I use)
- sorting the file in place. That's the path I went trough since it seemed the least resource-intensive solution.
Assumptions:
- the file on the disk, prior to inserting new samples is already sorted
- I have a total control over the new samples, therefore I can sort them in memory as well.
here's a piece of code I wrote for treating a file like a pseudo-array:
import os
import struct
from datetime import datetime
from typing import BinaryIO
class FileWrapper:
def __init__(self, file: BinaryIO, fmt: str):
self.file = file
self.fmt = fmt
self.sizeof_struct = struct.calcsize(fmt)
def __getitem__(self, index: int) -> int:
buffer = self.read_at(index)
_data = struct.unpack(self.fmt, buffer)
# timestamp is always the first value
return _data[0]
def __len__(self) -> int:
self.file.seek(0, os.SEEK_END)
return self.file.tell() // self.sizeof_struct
def __setitem__(self, index: int, content: bytes) -> None:
self.file.seek(index * self.sizeof_struct)
self.file.write(content)
## helper method
def read_at(self, index: int) -> bytes:
self.file.seek(index * self.sizeof_struct)
buffer = self.file.read(self.sizeof_struct)
return buffer
def swap(self, i: int, j: int) -> None:
if i == j:
return
buffer_i = self.read_at(i)
buffer_j = self.read_at(j)
self[i] = buffer_j
self[j] = buffer_i
I can use this class in the following way:
with open("input.bin", "r+b") as input_file:
fileWrapper = FileWrapper(input_file, "<IIdI")
...
because of how that's implemented, I can obtain the timestamp of the given sample via __getitem__
and treat the file as an sort of array (a random-access file for the lack of better word).
So before I started the tests on an actual file, I wanted to try this out on a "pure" array that would contain a list of integers. Needless to say, there is a magnitude of options for sorting algorithms, however most of them don't sort in-place. What I wanted to avoid was a situation where I'd copy huge chunks of the file around. Here's the algorithm I came up with using a piece of paper (though I'm affraid it's a simple bubble sort)
def merge(arr, start, mid, end):
start2 = end
last_swap = 0
swaps = 0
lookups = 0
while start2 > mid:
lookups += 1
if arr[start] <= arr[start2]:
start += 1
else:
swaps += 1
arr[start], arr[start2] = arr[start2], arr[start]
if last_swap == 0:
last_swap = start
start += 1
if start >= start2:
start = last_swap
start2 -= 1
n = len(arr)
print(f"lookups: {lookups}; swaps: {swaps} for n={n}")
What worries me about this is the number of lookups and swaps. Here's the example invocation and some stats:
data = [0, 1, 1, 2, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 13, 13, 13, 13, 13, 40]
new_items = [12, 5, 5, 5, 1, 1, 1, 0, 1, 18, 18, 8, 20]
new_items.sort(reverse=True)
num_new_items = len(new_items)
data.extend(new_items)
pa(data, num_new_items, "[input]")
size = len(data) - 1
merge(data, 0, size - num_new_items, size)
pa(data, num_new_items, "[final]")
[ ... ]
lookups: 326; swaps: 90 for n=33
[final]: [ 0 0 1 1 1 1 1 1 2 3 5 5 5 5 6 7 8 8 9 10] | [ 11 12 12 13 13 13 13 13 13 18 18 20 40] | start=None; start2=None
for 100_000 random integers and 1_000 "new" points the stats are:
lookups: 95742262; swaps: 9716 for n=101000
which is really, really bad
Is it possible that it could damage the hard drive ? Or would it just be desparately slow ? There's one more approach I thought of which would be the following:
1/ iterate over already sorted file, and iterate over the samples 2/ if the n-th sample is less than n-th sample from the file, then 3a/ create a new file, copy bytes from the input file from 0 to the offset at position n 3b/ write samples until the value is greater than n+1'th position in the original file 4/ repeat
of course it violates the basic assumption that everything must be done in-place. Though maybe it would be better for the hardware ? I was basically worried that if a file would be fairly large, say, 4 gigabytes worth of binary data then I'd end up copying huge blocks of the file over the hard drive.
initially I tried using radix sort, however I quickly realized that it would obliterate the hard drive as it moves the data a lot - that's the nature of it. I also tried insertion sort, merge sort and quicksort.
I'd like to stress out that (to best of my knowledge) this is not a typical sorting question. Basically, when we sort an array of integers in memory, both lookups and swaps are essentially "free", however when the disk is taken into account, I believe the nature is somewhat different.