I need to read a set of files and break it in to key-value pairs, and save these as a (key,list of values) for that key on disc, much like the map-reduce paradigm. Everything is on one computer though. I could for example write the different lists on different files and name the files with the key. That seems like a very poor way of doing things. To begin with if you have a billion keys, you will end up with a billion files. So obviously that is not going to work, and I will need some sort of memory mapping. I will also have to have different threads doing the map job, so if they were to write to this same buffer there is going to have to be some sort of synchronization between them. If I have a key-value buffer mapping, and synch over the buffers, then the threads shouldn't be stepping on each others toes, so I think that part should work. The question is how do I do the mapping of the values to disc. How do I write buffers that correspond to different keys in the same file? If someone could point me in the right direction, it would be much appreciated. My knowledge of this area is quite pathetic. Thanks again.
-
1What is it you're actually trying to do? What's the application? You talk about buffering and concurrency, but haven't really given a reason to take the problem in that direction (which makes it hard to understand why you're raising those issues.) And do you actually have a billion keys? Or a million? Is that a realistic requirement for what you're trying to do? – Marvo Aug 03 '11 at 18:51
-
I can't say what the application exactly is, but it is very similar to writing a MapReduce in a single machine. Read an input from files, break it into key-value pairs, and gather the values for particular keys. All the steps will have to be done on disc, as the data is in the billions of keys. Billions, not millions. – delmet Aug 03 '11 at 19:17
-
@delmet, I don't understand why you need "key-value buffer mapping"? Values are already mapped to keys... how would a buffer be mapped? Anyway, see my answer for more details. – Kiril Aug 03 '11 at 19:43
4 Answers
From a practical standpoint, it would be easy to do this with BerkeleyDB, as Lirik suggested.
If you are more interested in theory than practice, I'd suggest that you approach this as an "external sort" operation. That is, read as much input as you can into memory, then sort by key. Write the sorted chunk out as a single file. The sorted files can then be easily merged into a single file.
Among other applications, this is the approach used by Lucene to build "inverted indexes" for searching text. The "keys" are words in documents, and the "values" are a list of documents in which the word appears. Lucene reads documents, and for each word, creates a term-to-document entry in memory. When memory is full, it writes the index segment to disk. When there are a lot of index segments on disk, they are merged into a single segment. In fact, you could also adapt Lucene's index writer to your task.
The work can be partitioned into multiple threads. However, you have to be sensitive to disk contention. Skipping around to read and write many files concurrently will slow a conventional drive down a lot. There may be opportunities to schedule some activities concurrently. You could probably read in new data from one file while you are writing the previous sorted chunk to disk, especially if the machine has two disk drives. Of course, using an SSD for temporary storage of some of the sorted segments would help immensely.
-
Is that what he was asking when he said "how key-value mapping on disc can be accomplished"? In any case, yes, that's exactly how you would do it! The other thing to mention is that once you do the external sorting, then you can update the database with the contents of the segment and get rid of the segment. – Kiril Aug 03 '11 at 20:15
-
Yes, this is what I was talking about. Sorry for not being clear enough. This is also the way things are done in Hadoop I believe, and why the keys need to be comparable. I was wondering weather one could accomplish this only in one step, but I don't think it is possible. You need to sort and merge. – delmet Aug 03 '11 at 20:32
I think Oracle's Berkeley DB might be just the thing for you:
Berkeley DB is designed to store data as opaque byte arrays of data in key/value pairs indexed in one of the available access methods, as seen above.
Berkeley is very robust, mature and fast, but if you want to go with a more lightweight approach then use SQLite.
Another option is to use Google's LevelDB; it's written in C++ but there are Java wrappers around it. LevelDB is mind-numbingly fast and very lightweight!
Without having any more details on your project, I can only say:
- With all of these solutions the key/value pairs will be stored in the same file (multiple instances can store to separate files if necessary, but I don't see why it would be).
- BerkeleyDB and LevelDB have really good caching and mapping capabilities.
- BDB and LDB also allow for compression (not sure if SQLite does too).
- Depending on your key distribution (i.e. perhaps if you use a good hashing function like Google's CityHash), you may achieve really good data locality so you reduce table scans.
- You should probably write your own thread safe buffer(s) and you should avoid having multiple threads write to BDB/LDB since these solutions are disk-based and you generally don't want multi-threaded disk I/O operations.
Critique: - I'm not sure what you mean by "key-value buffer mapping"... are you mapping a buffer to each key? Why do you need that?

- 39,672
- 31
- 167
- 226
-
This would definitely work, but it would be nice how key-value mapping on disc can be accomplished. I suppose this is not a trivial issue. – delmet Aug 03 '11 at 19:39
-
1@delmet I'm not sure I understand what you're asking here: the values are mapped to the keys as part of the underlying structure of the database and the database is stored on disk (not in memory). Inserts, updates, reads, etc, are all done on a given key(s). The key-value mapping on disc is accomplished when you map a value to a key. Does that make sense? – Kiril Aug 03 '11 at 19:48
-
I was talking about about what Ericson above was talking about. See my comment there. Thanks for the help. – delmet Aug 03 '11 at 20:34
Chronicle Map should be a good solution for this problem.
Generally it is very efficient both in terms of operations speed and consumed memory, i. e. it's much faster than BerkeleyDB, suggested before.
Chronicle Map is a segmented storage, and allows parallel processing of segments, e. g:
for (int i = 0; i < chronicleMap.segments(); i++) {
int segmentIndex = i;
executor.submit(() -> {
chronicleMap.segmentContext(segmentIndex).forEachSegmentEntry(entry -> {
// do processing with entry.key() and entry.value(),
// value() could be a List or some Iterator-like abstraction
});
});
}
See MapSegmentContext
Javadocs.
However, having (logically) multiple values per key could not always be handled efficiently with Chronicle Map. But in your case, if you need just processing of the static set of values per each key, not adding/removing values, it could work well.