3

For some time i am working on creating index for very large data sets (around 190 million). I have a BTree which can insert data sets (typically an object)/search for key and while i searched how to persist the data into files in disk, i came across this amazing article (http://www.javaworld.com/article/2076333/java-web-development/use-a-randomaccessfile-to-build-a-low-level-database.html#resources). This pretty much gives me the starting point.

Here they are indexing String key to binary object (blob). They have the file format where they have divided it into 3 regions, header(stores start point of indexes), index(stores index and its corresponding location) and data region (stores data). They are using RandomAccessFile to get the data.

How do i define similar file format for btree. All i know is for every read made to disk, i have to get one node(typically one block 512 bytes). There are many similar questions on how to persist but it is little difficult to understand the big picture on why we decide on something that we implemented like this question (Persisting B-Tree nodes to RandomAccessFile -[SOLVED]). Please share your thoughts.

Community
  • 1
  • 1
Anandan
  • 353
  • 3
  • 17

2 Answers2

1

The are plenty of Open Source key/value stores and full database engines - take a week off and start Googling. Even if you end up using none of them, you still need to study a representative cross section (architecture, design histories, key implementation details) to get enough of an overview over the subject matter so that you can make informed decisions and ask intelligent questions. For a brief overview, try to Google details on index file formats, both historic ones like IDX or NTX, and current ones used in various database engines.

If you want to roll your own then you might consider hitching yourself to the bandwagon of an existing format, like the dBASE variants Clipper and Visual FoxPro (my favourite). This gives you the ability to work your data with existing tools, including Total Commander plugins and whatnot. You don't need to support the full formats, just the single binary instance of the format that you choose for your project. Great for debugging, reindexing, ad hoc queries and so on. The format itself is dead simple and easy to generate even if you don't use any of the existing libraries. The index file formats aren't quite as trivial but still manageable.

If you want to roll your own from scratch then you've got quite a road ahead of you, since the basics of intra-node (intra-page) design and practice are poorly represented on the Internet and in literature. For example, some old DDJ issues contained articles about efficient key matching in connection with prefix truncation (a.k.a. 'prefix compression') and so on but I found nothing comparable out there on the 'net at the moment, except buried deeply in some research papers or source code repositories.

The single most important item here is the algorithm for searching prefix-truncated keys efficiently. Once you've got that, the rest more or less falls into place. I have found only one resource on the 'net, which is this DDJ (Dr Dobb's Journal) article:

A lot of tricks can also be gleaned from papers like

For more details and pretty much everything else you could do a lot worse than reading the following two books cover to cover (both of them!):

An alternative to the latter might be

It covers a similar spectrum and it seems to be a bit more hands-on, but it does not seem to have quite the same depth. I cannot say for certain, though (I've ordered a copy but haven't got it yet).

These books give you a complete overview over all that's involved, and they are virtually free of fat - i.e. you need to know almost everything that's in there. They will answer gazillions of questions that you didn't know you had, or that you should have asked yourself. And they cover the whole ground - from B-tree (and B+tree) basics to detailed implementation issues like concurrency, locking, page replacement strategies and so forth. And they enable you to utilise the information that is scattered over the 'net, like articles, papers, implementation notes and source code.

Having said that, I'd recommend matching the node size to the architecture's RAM page size (4 KB or 8 KB), because then you can utilise the paging infrastructure of your OS instead of running afoul of it. And you're probably better off keeping index and blob data in separate files. Otherwise you couldn't put them on different volumes and the data would b0rken the caching of the index pages in subsystems that are not part of your program (hardware, OS and so forth).

I'd definitely go with a B+tree structure instead of watering down the index pages with data as in a normal B-tree. I'd also recommend using an indirection vector (Graefe has some interesting details there) in connection with length-prefixed keys. Treat the keys as raw bytes and keep all the collation/normalisation/upper-lower nonsense out of your core engine. Users can feed you UTF8 if they want - you don't want to have to care about that, trust me.

There is something to be said for using only suffix truncation in internal nodes (i.e. for distinguishing between 'John Smith' and 'Lucky Luke', 'K' or 'L' work just as well as the given keys) and only prefix truncation in leaves (i.e. instead of 'John Smith' and 'John Smythe' you store 'John Smith' and 7+'ythe').

It simplifies the implementation, and gives you most of the bang that could be got. I.e. shared prefixes tend to be very common at the leaf level (between neighbouring records in index order) but not so much in internal nodes, i.e. at higher index levels. Conversely, the leaves need to store the full keys anyway and so there's nothing to truncate and throw away there, but internal nodes only need to route traffic and you can fit a lot more truncated keys in a page than non-truncated ones.

Key matching against a page full of prefix-truncated keys is extremely efficient - on average you compare a lot less than one character per key - but it's still a linear scan, even with all the hopping forward based on skip counts. This limits effective page sizes somewhat, since binary search is more complicated in the face of truncated keys. Graefe has a lot of details on that. One workaround for enabling bigger node sizes (many thousands of keys instead of hundreds) is to lay out the node like a mini B-tree with two or three levels. It can make things lightning-fast (especially if you respect magic thresholds like 64-byte cache line size), but it also makes the code hugely more complicated.

I'd go with a simple lean and mean design (similar in scope to IDA's key/value store), or use an existing product/library, unless you are in search of a new hobby...

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
  • Really thanks for such a detailed discussion. Now i see why google couldn't help me. Yes B+ tree is very much better, i am also not storing blob because it complicates everything (so just key value pair). Now thanks to you, i will look into dbase file format (will greatly help me in debugging, i was struggling with .dat file) and truncated keys seems great idea. – Anandan Aug 06 '16 at 02:07
  • @Anandan: I've refound the DDJ article and added a link, and also a link to an interesting paper about indexes in DB2. Search for information about compact Visual FoxPro indexes (CDX) to get started, and maybe get yourself a copy of Visual FoxPro 8 or later on Ebay. Have fun! – DarthGizka Aug 06 '16 at 04:44
  • @Anandan: If you are really interested then I could whip up something next weekend (I've been meaning to do that for a long time anyway). Just something basic with indirection vector, heap and prefix truncation, as a baseline for experimenting with node layouts etc. At least then I get to find out how well binary search really fares in connection with truncated key reassembly... I've conjectured that it could be beneficial for the large key counts enabled by large node sizes (e.g. 4 KByte) but guessing and knowing are two different things.;-) – DarthGizka Aug 15 '16 at 06:51
  • Really would love to get any pointers like that and really appreciate your interest. My approach is quite simple. I am using B+Tree and RandomFileAccess (may be bulk load initially). one approach is in this site (https://people.cs.clemson.edu/~wayne/cpsc241/spring01/assignments/assignment3.html). There are many approaches but your link (on using Zip - DDJ link) is quite helpful. I haven't started coding (except of course some basic B+Tree with Files). Because this topic is very much like rabbit's hole. I keep digging. – Anandan Aug 16 '16 at 23:18
  • And my key size is 64 bytes and with some calculations i got i can accommodate around 58 keys (for 4K page) if i were to use the logic given in the link that i gave. – Anandan Aug 16 '16 at 23:21
  • @Anandan: should your keys be hashes (e.g. SHA256) then you won't get much mileage out of prefix trunctation in leaves - better make those fixed-size arrays (binary searchable). OTOH you will get a lot of mileage out of suffix truncation in internal nodes (hundreds of shortened keys per node); you might even consider a couple levels of a trie-like structure, i.e. dispatching on byte 0, then on byte 1 and so on with arrays of 256 links (google trie and radix sorting). Below the virtual trie-like 'super root' you could have B+tree-ish internal nodes with fixed-size shortened keys, like 32 bits. – DarthGizka Aug 17 '16 at 09:01
  • @Anandan: could you add a couple specifics about your application to your question? Like what kind of data is to be associated with the keys (fixed or variable size, small or big) and the expected operating size range (e.g. record counts between a few thousand and a billion, or between 200 million and half a billion). And also whether your keys a fixed size or not, and whether they are hashes (hence with perfect random spread) or not. Another important thing to know is the relative frequency of updates vs queries after the initial building of the table. – DarthGizka Aug 17 '16 at 09:11
  • @Anandan: if your keys are hashes and you have sufficient hard disk space and record count of reasonable fixedness (like e.g 190 million) then you could use hashing directly, i.e. take the first 23 bits as page number and organise each page as a sorted array. Make some allowance for chaining (perhaps in a second file) but that will occur only rarely. This way you'll need no more than one disk read per lookup most of the time - it's hard to do better than that. And B-trees (including B+, B*, Blink and whatnot) all have some sparsity, so hashing looks pretty good. – DarthGizka Aug 17 '16 at 10:36
  • Thanks and sorry i got held up with other things. The key is hash value of an image and value is name of the image. So the value is not fixed length. The idea of truncating keys seem viable but i don't know how the hash is going to be generated so i don't know how i can truncate so that i could get to use max keys in node(because some might to be bigger). – Anandan Aug 20 '16 at 03:09
  • And hashing looks great. But don't you think i have more chance of reaching the hash value by traversing through tree. I still don't clearly understand how suffix truncated keys are accessed faster through hashes. I don't have enough memory(RAM) to store all hashes (may be even the suffix truncated keys) . So i will have to replace pages every time. Rather than doing ad-hoc (from top byte to page length every loop) i thought trees would be better. But my professor said the same thing u said just now. And i don't know how? Can you help me understand the same? – Anandan Aug 20 '16 at 03:18
1

Here is an alternative take on the question, based on problem specifics that have become known in the meantime. This post is based on the following assumptions:

  • record count about 190 million, fixed
  • keys are 64-byte hashes, like SHA-256
  • values are filenames: variable length, but sensible (average length < 64 bytes, max < page)
  • page size 4 KiByte

Efficient representation of filenames in a database is a different topic that cannot be addressed here. Should the filenames be awkward - longish on average and/or Unicode - then the hashing solution will punish you with increased disk read counts (more overflows, more chaining) or reduced average occupancy (more wasted space). A B-tree solution reacts somewhat more benignly, though, since an optimum tree can be constructed in any case.

The most efficient solution in this situation - and the simplest to implement by a wide margin - is hashing, since your keys are perfect hashes already. Take the first 23 bits of the hash as the page number, and lay out the pages like this:

page header
    uint32_t next_page
    uint16_t key_count
key/offset vector
    uint16_t value_offset;
    byte key[64];

... unallocated space ...

last arrived filename
...
2nd arrived filename
1st arrived filename

Values (filenames) are stored from the end of the page downwards, prefixed with their 16-bit length, and the key/offset vector grows upwards. That way neither low/high key counts nor short/long values can cause unnecessary waste of space, as would be the case with fixed-size structures. Nor do you have to parse variable-length structures during key searches. Apart from that I've aimed for the greatest possible simplicity - no premature optimisation. The bottom of the heap can be stored in the page header, in KO.[PH.key_count].value_offset (my preference), or computed as KO.Take(PH.key_count).Select(r => r.value_offset).Min(), whatever pleases you most.

The key/offset vector needs to be kept sorted on the keys so that you can use binary search but the values can be written as they arrive, they do not need to be in any particular order. If the page overflows, allocate a new one just like it at the current end of the file (growing the file by one page) and stash its page number in the appropriate header slot. This means that you can binary search within a page but all chained pages need to be read and searched one by one. Also, you do not need any kind of file header, since the file size is otherwise available and that's the only piece of global management information that needs to be maintained.

Create the file as a sparse file with the number of pages as indicated by your chosen number of hash key bits (e.g. 8388608 pages for 23 bits). Empty pages in a sparse file don't take up any disk space and read as all 0s, which works perfectly fine with our page layout/semantics. Extend the file by one page whenever you need to allocate an overflow page. Note: the 'sparse file' thing isn't very important here since almost all pages will have been written to when you're done building the file.

For maximum efficiency you need to run some analyses on your data. In my simulation - with random numbers as stand-ins for the hashes, and on the assumption that average filename size is 62 bytes or less - the optimum turned out to be making 2^23 = 8388608 buckets/pages. This means that you take the first 23 bit of the hash as the page number to load. Here are the details:

# bucket statistics for K = 23 and N = 190000000 ... 7336,5 ms

average occupancy 22,6 records
0 empty buckets (min: 3 records)
310101/8388608 buckets with 32+ records (3,7%)

That keeps the chaining to a minimum, on average you need to read just 1.04 pages per search. Increasing the hash key size by one single bit to 24 reduces the expected number of overflowing pages to 3 but doubles the file size and reduces average occupancy to 11.3 records per page/bucket. Reducing the key to 22 bits means that almost all pages (98.4%) can be expected to overflow - meaning the file is virtually the same size as that for 23 bits but you have to do twice as many disk reads per search.

Hence you see how important it is to run a detailed analysis on the data to decide on the proper number of bits to use for hash addressing. You should run an analysis that uses the actual filename sizes and tracks the per-page overhead, to see what the actual picture looks like for 22 bits to 24 bits. It'll take a while to run but that's still way faster than building a multi-gigabyte file blindly and then finding that you have wasted 70% of space or that searches take significantly more than 1.05 page reads on average.

Any B-tree based solution would be much more involved (read: complicated) but could not reduce the page read count per search below 1.000, for obvious reasons, and even that only on the assumption that a sufficient number of internal nodes can be kept cached in memory. If your system has such humongous amounts of RAM that data pages can be cached to a significant degree then the hashing solution will benefit just as much as one that is based on some kind of B-tree.

As much as I would like an excuse for building a screamingly fast hybrid radix/B+tree, the hashing solution delivers essentially the same performance for a tiny fraction of the effort. The only thing where B-treeish solutions can outdo hashing here is space efficiency, since it is trivial to construct an optimum tree for existing pre-sorted data.

DarthGizka
  • 4,347
  • 1
  • 24
  • 36
  • Thank you for such a great explanation. Great metrics really helpful. There are still certain things not clear to me. What do you mean by the bottom of the heap? Is it the unused space (to help insert the next key). On the analysis part, if my understanding is correct, 22 records (key,value) can be stored on a bucket or page (since bucket id is page number) and only 7% have more than 30 records. So on average (8388608-310101) buckets have 1 page to store their records and 310101 need two page to store theirs and hence 1.04 page/search. – Anandan Aug 26 '16 at 21:52
  • Those are great numbers . I can't thank you enough. I was held up with couple of academic things. Now its getting very interested thanks to you. I will start working on it. Also, I know i am asking more but how did you figure out the metrics, used any byte reader tools, will be great if you could point me in right direction. So that i can make some analysis. – Anandan Aug 26 '16 at 22:06
  • Also there are 8 million pages, how do we locate the exact page? Is it anything to do with sparse file? – Anandan Aug 26 '16 at 22:46
  • Should i have to construct an hashtable for the keys in RAM (sounds reasonable). But is there any alternative or it the best? – Anandan Aug 27 '16 at 00:33
  • @Anandan: no, not a hash table in RAM. You simply use the first 23 bits of your SHA-256, look at them sternly and call them 'page/bucket number'. ;-) Multiply with the page size and you have the file offset of the page.To get the numbers I ran simulations using random numbers as stand-ins for the hashes, which is less error-prone than trying to do a Knuth on the numbers (at least for me). With 'bottom of the heap' I meant 'bottom of the used portion of the heap', i.e. the end of the unused space between the end of the vector and the keys which are written downwards from the end of the page. – DarthGizka Aug 27 '16 at 08:07
  • 1
    Thank you. I wanted to mark it as correct after completing coding. But i can't think of any better way. – Anandan Aug 31 '16 at 14:09
  • I have started implementing, what will the next page if current page overflows. Will it be 8388608 + 1 (total pages 2^23). So i need to keep track of total pages as global information instead of file size. Is it right? Is there better way? – Anandan Sep 01 '16 at 15:02
  • @Anandan: the file size is a function of the total page count, and it can be queried via suitable operating system functions. This means that there is no need to store the total page count separately on disk. Of course, you can cache the page count in a suitable global variable while you have the file open and are working on it. But the value does not need to be persisted or tracked in any way. Simply compute it from the file size whenever you need it and it isn't cached. – DarthGizka Sep 02 '16 at 07:36
  • @Anandan: please note that building the table by running 190 million updates on an empty file is not very efficient. One easy way is to determine the number of pages that can be kept in memory at the same time. Then split the input into separate piles (files) such that all records written to a file belong to a contiguous set of hash buckets that can be cached in memory. Then process the piles one by one. This avoids the '1 record, 1 page write' behaviour of the naive approach which would read-modify-write each output page dozens of times on average. – DarthGizka Sep 02 '16 at 07:43