I have a small custom database and am curious if I should be handling the way I deal with data updates differently:
Currently the structure I write to a file on the HD is made up like this:
Header(uniqueID,lengthOfDataInBytes,HeaderChecksum)
data
In the file there are thousands of these structures and the data part is a few hundred kb on average.
If I want to update/remove a structure, I read all the following structures into memory, write them back to the file at the start of the structure I want to update/remove, clear my indexer dictionary, then append the updated structure to the end of the file/do nothing and let my indexer run over the whole file again.
This works quite well, as the usual filesize is ~2Gbyte and structures which are updated are the most likely candidates to be updated again, thus making constant updates on structures at the very start of the file very unlikely.
However I am not prepared for a case where a user has filesizes bigger than his RAM and I guess that scenario would break my current setup of updating/removing parts?
Is there a common practice of how this should be solved? Alternatives I have in mind would be:
overwrite the header of an updated/removed structure with a 'skip this sector'command, keeping it in the file as junkcode and appending the updated version to the end. The upside is that I don't have to read all the following sectors. The downside is, that I have to decide a good time to run a cleanup-routine.
split the database into multiple files of a fixed size and add a file-pointer for the needed sector to my indexer. Keep my old way of updating/removing. Updside: doesn't need further cleanup work Downside: adds another level of abstraction
How is this commonly handled? Are there any better solutions to this problem out there?
Edit: Please stop suggesting the use of sql. I tried it and it performs far worse than my currently working solution. If that is hard to believe, consider this:
- I have no redundant memory buffers on both sides.
- I hold the references of my buffered data.
- I don't need to waste extra cycles on query-strings.
- I can fill the delays in HD-read/write-time with already doing some de-/ serialization work on the already read/ about to be written data and don't have to wait for the database to return my queryresult / have to do all that before I pass it to sql.(this has by far the biggest impact)