Is there a proper way to manage data removal from a file?

Question

I have a small custom database and am curious if I should be handling the way I deal with data updates differently:

Currently the structure I write to a file on the HD is made up like this:

Header(uniqueID,lengthOfDataInBytes,HeaderChecksum)
data

In the file there are thousands of these structures and the data part is a few hundred kb on average.

If I want to update/remove a structure, I read all the following structures into memory, write them back to the file at the start of the structure I want to update/remove, clear my indexer dictionary, then append the updated structure to the end of the file/do nothing and let my indexer run over the whole file again.

This works quite well, as the usual filesize is ~2Gbyte and structures which are updated are the most likely candidates to be updated again, thus making constant updates on structures at the very start of the file very unlikely.

However I am not prepared for a case where a user has filesizes bigger than his RAM and I guess that scenario would break my current setup of updating/removing parts?

Is there a common practice of how this should be solved? Alternatives I have in mind would be:

overwrite the header of an updated/removed structure with a 'skip this sector'command, keeping it in the file as junkcode and appending the updated version to the end. The upside is that I don't have to read all the following sectors. The downside is, that I have to decide a good time to run a cleanup-routine.
split the database into multiple files of a fixed size and add a file-pointer for the needed sector to my indexer. Keep my old way of updating/removing. Updside: doesn't need further cleanup work Downside: adds another level of abstraction

How is this commonly handled? Are there any better solutions to this problem out there?

Edit: Please stop suggesting the use of sql. I tried it and it performs far worse than my currently working solution. If that is hard to believe, consider this:

I have no redundant memory buffers on both sides.
I hold the references of my buffered data.
I don't need to waste extra cycles on query-strings.
I can fill the delays in HD-read/write-time with already doing some de-/ serialization work on the already read/ about to be written data and don't have to wait for the database to return my queryresult / have to do all that before I pass it to sql.(this has by far the biggest impact)

What's the motivation to load everything into RAM? The most performant option for this kind of operation is a memory-mapped file (`FileChannel#map`). — Marko Topolnik, Nov 15 '16 at 08:58
out of curiosity: Why aren't you using a 'real' database engine? Something like sqlite seems like what you would want to use. — Steffen Winkler, Nov 15 '16 at 09:01
@SteffenWinkler because I tried it and it was slower by about a factor of 100 — user3488765, Nov 15 '16 at 09:05
@MarkoTopolnik I am not loading everything into ram. When I read I only read the wanted sectors, when I write, I only read the sectors following the one I want to update — user3488765, Nov 15 '16 at 09:07
That's still O(n) if you load the entire suffix following the deleted/updated sector. There is no need for that. Otherwise, if you're not doing that, then I don't understand your question, which is about using too much RAM. — Marko Topolnik, Nov 15 '16 at 09:09
@user3488765 that seems very unlikely unless you write data with multiple threads. That's about the only thing I ever saw sqlite being the slower one as, at least at the time, it didn't support that. — Steffen Winkler, Nov 15 '16 at 09:09
@user3488765 I am very much convinced: if your experiments showed you that using a DB slows you down by a factor of 100 ... then you did something wrong. I find your approach of dealing with such enormous amounts of data to "manually" by re-inventing your **own** database layer to be well, "one interesting idea". — GhostCat, Nov 15 '16 at 09:13
@GhostCat My database uses spatial indexing, getting a sector(edit: and the contained data into the needed format) is a O(1) operation for me. I don't know what the relational databases like sql do to find the data, but for me it was slower — user3488765, Nov 15 '16 at 09:26

score 2 · Answer 1 · answered Nov 15 '16 at 09:04

2

Consider replacing a custom file format with an actual database such as SQLite. (Or maybe even a client/server database such as MySQL or SQL Server.)

At the cost of additional implementation effort you get the following benefits:

Tested and proven code handling your data.
Random access to data (the db does the indexing of records for you) meaning fast inserts/updates/deletes.

In your case, the uniqueID would become the table's primary key, you could drop the checksum and length of data columns and make the data column a blob or text (depending on content).

answered Nov 15 '16 at 09:04

Misza

635
4
15

tried it, is about a hundred times slower, furthermore it would require customers to also setup the db – user3488765 Nov 15 '16 at 09:06
1

@user3488765 I can't imagine the slowness is caused by the engine itself (especially not for client/server engines) but more by a specific use case (we'd need more details and it's a candidate for a SQLite - specific question). Also, SQLite does not require a setup. If you distribute the driver along with your app, it [can create databases on its own](http://stackoverflow.com/questions/24178930/programmatically-create-sqlite-db-if-it-doesnt-exist). – Misza Nov 15 '16 at 09:11
1

@user3488765 There are tons and tons of applications that use databases internally; without asking its users to install / setup anything related to databases. You can hide all that complexity from your user. Of course, that is work, but I think: worth the effort. – GhostCat Nov 15 '16 at 09:19
In reference to what @GhostCat wrote, you might want to review [a whole question related to SQLite performance](http://stackoverflow.com/questions/1711631/improve-insert-per-second-performance-of-sqlite). Its examples are written in C but general concepts (like using transactions or some SQLite options) are language-agnostic. – Misza Nov 15 '16 at 09:24
@Misza the average updatetimes in your link are ~10 seconds, I'm in the ms range for the same amount of data. Why is it hard to believe that a costum solution performs better than a generic one? – user3488765 Nov 15 '16 at 09:33
@user3488765 Yes, 10 seconds total for nearly 900k rows. Not 10 seconds per row. :) Why is it hard to believe that thousands of people have already been in your spot and instead of reinventing the wheel they took a tested solution off the shelf? – Misza Nov 15 '16 at 09:38
@Misza yes, but consider this: in the end these databases do the same as me: they store bits on the HD. They are bound by the same rules as I am when it comes to reading/writing to it. Why I am faster with my costum solution: I know what the data I stored represents, thus I can fill the gaps where the HD is working on fetching the data with transforming that what I already got into the form i need. If I use f.e. sql, I have to query the raw data first, and can start the transformation only once I got the whole request. That is where that enormous speed-boost comes from – user3488765 Nov 15 '16 at 09:46
@user3488765 True, and I can imagine your solution has better performance *for certain operations* but you're paying with RAM for it. And now you've hit a wall (otherwise you wouldn't ask this question) and realized a limitation. Additionally, it's been pointed out that your delete operation is an O(n) while databases usually perform them in O(log(n)) and with an O(1) memory footprint. The only thing I can imagine if you intend to stick to your solution is to (once a record is deleted) copy the subsequent ones in small chunks several bytes back and forward. But that'd be a lot of file seeking. – Misza Nov 15 '16 at 12:55
@Misza yes, it's optimized for reading. And I use less ram on reads as I would with an additional database, as I only have one buffer on the data. With sql there would be the database internal buffer and , mine with the serialized objects. My post was more about the issue with writing and I just wanted to know how DBs handle it, they also are bound by the same rules as I am when writing, so I just was curious about the techniques. I posted 2 possible solutions in my OP: one is O(2) the other is O(n/split) with split the amount of subfiles. But I am curious how its actually done – user3488765 Nov 15 '16 at 13:11
1

@user3488765 Internally, databases organize data in "pages". Page is a fixed size segment of data (e.g. 4KB) that the engine writes/reads from disk as one (a page can contain several data rows/records but database never accesses them individually). Pages can be marked as used or unused (so inserting data causes new pages to be allocated to the db file and deleting causes them to be marked as freed). Pages are organized in B-Trees. That being the gist of it, I recommend googling for "database page" - you will find how different engines do it in details. – Misza Nov 15 '16 at 13:40
@Misza thanks, that clears it up.Isn't that what I stated in my second bulletpoint?As I use 3d vectors for indexing, extending that logic to a multi-file-solution should be rather straight-forward – user3488765 Nov 15 '16 at 13:55
1

@user3488765 It's a mix of both, I'd say. Databases use the idea of marking pages as unused (and ready for reuse) as in your first bullet point but the concept of an index pointing to entire pages and not individual file positions is more common with the second bullet point. However, I don't know how many records are there in your database, but if you're going to keep each one in a separate file, be aware of [file system limitations](http://stackoverflow.com/questions/466521/how-many-files-can-i-put-in-a-directory). – Misza Nov 15 '16 at 14:09

Is there a proper way to manage data removal from a file?

1 Answers1