6

I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.

Our requirements are:

  1. Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
  2. Very fast random lookup by id (e.g. document URL)
  3. Accessible by both Java and Perl
  4. Available on the most important Linux-Distros and Windows

I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:

  1. No clustering required
  2. No daemon ("service") required
  3. No clever search functionality required

Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)

Soo my question boils down to

  • is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
  • are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
  • (for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?
KoW
  • 784
  • 5
  • 12
  • 1
    For the overkill. If they are simple to use, and maybe can be embedded, they can be a good fit too... What can do the more can do the less too. – Nicolas Bousquet May 15 '11 at 13:46

2 Answers2

6

I don't think that Bitcask is going to work well for your use-case. It looks like the Bitcask model is designed for use-cases where the size of each value is relatively small.

The problem is in Bitcask's data file merging process. This involves copying all of the live values from a number of "older data file" into the "merged data file". If you've got millions of values in the region of 100Kb each, this is an insane amount of data copying.


Note the above assumes that the XML documents are updated relatively frequently. If updates are rare and / or you can cope with a significant amount of space "waste", then merging may only need to be done rarely, or not at all.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • Thanks for the input. Will have to test it – KoW May 16 '11 at 09:33
  • @kindofwhat - good idea. My Answer is purely based on my reading of the paper that describes how it works. Another idea would be to ask the authors ... – Stephen C May 17 '11 at 00:10
  • @StephenC I think the Bitcask model is designed for the case that value is much bigger then key. Because bitcask will put all the key in the hashtable, and the hashtable is put in the memory. so if the value is relatively small, then you may have lots of keys, and you don't have enough memory to store the whole keys. As you said, Bitcask has compact operation. There is a balance between disk space and write amplification. If you are afraid of write amplification, you don't need compact or you can compact when you need it.And at his use case, I think change the XML data is a rarely operation. – baotiao Apr 04 '16 at 17:54
5

Bitcask can be appropriate for this case (large values) depending on whether or not there is a great deal of overwriting. In particular, there is not reason to merge files unless there is a great deal of wasted space, which only occurs when new values arrive with the same key as old values.

Bitcask is particularly good for this batch load case as it will sequentially write the incoming data stream straight to disk. Lookups will take one seek in most cases, although the file cache will help you if there is any temporal locality.

I am not sure on the status of a Java version/wrapper.

  • There seems to be a "native" Java [implementation](https://github.com/krestenkrab/bitcask-java) of the Bitcask API. Merging not yet implemented although, so hard to test this case with this implementation. – KoW May 17 '11 at 10:01