3

I'm confused about the advantage of embedded key-value databases over the naive solution of just storing one file on disk per key. For example, databases like RocksDB, Badger, SQLite use fancy data structures like B+ trees and LSMs but seem to get roughly the same performance as this simple solution.

For example, Badger (which is the fastest Go embedded db) takes about 800 microseconds to write an entry. In comparison, creating a new file from scratch and writing some data to it takes 150 mics with no optimization.

EDIT: to clarify, here's the simple implementation of a key-value store I'm comparing with the state of the art embedded dbs. Just hash each key to a string filename, and store the associated value as a byte array at that filename. Reads and writes are ~150 mics each, which is faster than Badger for single operations and comparable for batched operations. Furthermore, the disk space is minimal, since we don't store any extra structure besides the actual values.

I must be missing something here, because the solutions people actually use are super fancy and optimized using things like bloom filters and B+ trees.

rampatowl
  • 1,722
  • 1
  • 17
  • 38
  • 1
    Database is not just a matter of writing content to disk, but how to manage them (transaction, query, update, sorting, etc.). If you store your content to *raw file* (e.g. one disk per key), you won't have those advantages, unless you implement those features by yourself. Furthermore, every file has *metadata* which occupies disk space. If your content size is less than the metadata size, then "one file per key" solution will be inefficient. – putu Apr 10 '18 at 05:04
  • 1
    Just a wild guess here, I think you will eventually hit file system bottlenecks as more and more small files are added. Also, batch writes and reads probably will outpace single file open / write / read operations as number of keys in each batch increases? – Aaron Qian Apr 10 '18 at 05:38

3 Answers3

2

But Badger is not about writing "an" entry:

My writes are really slow. Why?

Are you creating a new transaction for every single key update? This will lead to very low throughput.

To get best write performance, batch up multiple writes inside a transaction using single DB.Update() call.
You could also have multiple such DB.Update() calls being made concurrently from multiple goroutines.

That leads to issue 396:

I was looking for fast storage in Go and so my first try was BoltDB. I need a lot of single-write transactions. Bolt was able to do about 240 rq/s.

I just tested Badger and I got a crazy 10k rq/s. I am just baffled

That is because:

LSM tree has an advantage compared to B+ tree when it comes to writes.
Also, values are stored separately in value log files so writes are much faster.

You can read more about the design here.


One of the main point (hard to replicate with simple read/write of files) is:

Key-Value separation

The major performance cost of LSM-trees is the compaction process. During compactions, multiple files are read into memory, sorted, and written back. Sorting is essential for efficient retrieval, for both key lookups and range iterations. With sorting, the key lookups would only require accessing at most one file per level (excluding level zero, where we’d need to check all the files). Iterations would result in sequential access to multiple files.

Each file is of fixed size, to enhance caching. Values tend to be larger than keys. When you store values along with the keys, the amount of data that needs to be compacted grows significantly.

In Badger, only a pointer to the value in the value log is stored alongside the key. Badger employs delta encoding for keys to reduce the effective size even further. Assuming 16 bytes per key and 16 bytes per value pointer, a single 64MB file can store two million key-value pairs.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Thanks for the reply! It makes sense that batching the disk writes would improve speed, but even these batched writes are slower than I would imagine. The 10k/sec you quoted is still 100mics per write, which is about as slow as opening a separate file for each write. – rampatowl Apr 10 '18 at 04:46
  • 1
    @rampatowl the design document shows other advantages than just writing though. – VonC Apr 10 '18 at 04:46
  • Still, it seems like read/write time should be the main way to compare embedded key-value databases. I'm just surprised that the state of the art data structures are approximately as fast as opening up your own file for each key (100 mics per read/write). Am I missing something? – rampatowl Apr 10 '18 at 04:51
  • @rampatowl this is not about read/write files, but about entry lookup optimization. – VonC Apr 10 '18 at 04:53
  • I edited my original post with clarification. The reason I am comparing Badger's operation time to file read/write times is because you can implement a minimal key value store with one file per key, where each entry lookup takes as long as a single file access (with no additional data structures). – rampatowl Apr 10 '18 at 05:01
2

Your question assumes that the only operation needed are single random reads and writes. Those are the worst case scenarios for log-structured merge (LSM) approaches like Badger or RocksDB. The range query, where all keys or key-value pairs in a range gets returned, leverages sequential reads (due to the adjacencies of sorted kv within files) to read data at very high speeds. For Badger, you mostly get that benefit if doing key-only or small value range queries since they are stored in a LSM while large values are appended in a not-necessarily sorted log file. For RocksDB, you’ll get fast kv pair range queries.

The previous answer somewhat addresses the advantage on writes - the use of buffering. If you write many kv pairs, rather than storing each in separate files, LSM approaches hold these in memory and eventually flush them in a file write. There’s no free lunch so asynchronous compaction must be done to remove overwritten data and prevent checking too many files for queries.

Bill Katz
  • 306
  • 2
  • 7
0

Previously answered here. Mostly similar to other answers provided here but makes one important, additional point: files in a filesystem can't occupy the same block on disk. If your records are, on average, significantly smaller than typical disk block size (4-16 KiB), storing them as separate files will incur substantial storage overhead.

Jason DeMorrow
  • 489
  • 5
  • 9