I am looking for a simple way to store and retrieve millions of xml files. Currently everything is done in a filesystem, which has some performance issues.
Our requirements are:
- Ability to store millions of xml-files in a batch-process. XML files may be up to a few megs large, most in the 100KB-range.
- Very fast random lookup by id (e.g. document URL)
- Accessible by both Java and Perl
- Available on the most important Linux-Distros and Windows
I did have a look at several NoSQL-Platforms (e.g. CouchDB, Riak and others), and while those systems look great, they seem almost like beeing overkill:
- No clustering required
- No daemon ("service") required
- No clever search functionality required
Having delved deeper into Riak, I have found Bitcask (see intro), which seems like exactly what I want. The basics described in the intro are really intriguing. But unfortunately there is no means to access a bitcask repo via java (or is there?)
Soo my question boils down to
- is the following assumption right: the Bitcask model (append-only writes, in-memory key management) is the right way to store/retrieve millions of documents
- are there any viable alternatives to Bitcask available via Java? (BerkleyDB comes to mind...)
- (for riak specialists) Is Riak much overhead implementation/management/resource wise compared to "naked" Bitcask?