0

I need to index a large number of Java properties and manifest files.

The data in the files is just key-value pairs.

I am thinking to use Lucene for this.

However, I do not need any real full-text search capabilities, as the data is quite structured. I only need to search for exact matches of property values, and the property key is always known. There is no need for tokenizing, and there is also no "default" field. The number of unique property keys could be quite large.

I should also add that I hope to be able to hold the index entirely in memory (in Lucene that would be a RAMDirectory).

So, is Lucene (as primarily a full-text search-engine) still a good match, or does something else fit better?

Update: A simple HashMap will not do, because I want to find the files that define property A as value B. It would need to be at least a nested HashMap to hold the triples ( Key , Value, Filename ).

Thilo
  • 257,207
  • 101
  • 511
  • 656
  • Is the number of key-value pairs large enough to rule out in-memory hashmap? You could index them into an embedded SQL server. – akarnokd Jun 23 '09 at 10:49
  • Yeah, I thought about embedded SQL. Problem is that I cannot use keys as columns, because they are numerous and not known it advance, so it would need to be a key-value-mapping table. – Thilo Jun 23 '09 at 13:25
  • Why would you use keys as columns? Just have a (file, key, value) triplet. – akarnokd Jun 24 '09 at 21:17
  • A key-value mapping table with just triplets makes it very inefficient to do a query like "select filename where a = ? and b = ?". You have to use self-joins and you cannot build per-property indexes. – Thilo Jun 24 '09 at 22:59

4 Answers4

2

Yes, a Lucene index with a non tokenized field per key will do the trick. It's also a bit of an overkill, some sort of Map structure will probably be enough for what you are describing.

The main benefit of using Lucene here would be that it abstracts away the details into a fairly simple API.

Sindri Traustason
  • 5,445
  • 6
  • 48
  • 66
0

I would start with a simple HashMap, and if you run into memory problems then move to something more complicated like Lucene. You'd be surprised how efficient a HashMap can be.

If you want to start really simple, just use the Properties object itself - it's an instance of HashTable (see HashMap vs HashTable). You can easily use load(Inputstream) to load multiple properties files into a simple object, and then if you decide to try HashMap switch it using new HashMap(propertiesObject).

Community
  • 1
  • 1
Spyder
  • 1,882
  • 13
  • 23
  • The thing is that I have many Properties objects, not just one, and I want to search across properties files, not within a property file. I need to get the list of files that say "a=b". – Thilo Jun 23 '09 at 13:31
  • Ah that's a bit clearer. I guess if you can guarantee uniqueness you could read it in as text and then make each line the key and the filename the value. But I think at that point some kind of database-type solution would be better. – Spyder Jun 23 '09 at 13:58
0

If you don't need full-text searching, and only want to represent a large key-value map, then I suggest that Lucene is inappropriate.

I'd suggest something like EhCache, which allows you to hold a large chunk of the data in RAM, but can swpa out to a disk file if it gets too large.

skaffman
  • 398,947
  • 96
  • 818
  • 769
  • It is not really one large key-value map, more like a large collection of small key-value maps that I want to search across (not within). I want to find all the maps that say "a=b" and "c=d" for example. – Thilo Jun 23 '09 at 13:36
0

Take a look at jdbm - it is a light-weight, open source object database that has a fast B+Tree implementation that should work for you. If you don't need high-reliability, you can turn off the log part of the database (this makes inserts much faster, at the risk of corrupting the database if you have a power failure in the middle of a write).

We've been using jdbm in several production projects for 4 or 5 years now with some really, really big data sets.

If you can hold the entire index in memory, though, you'd probably be better off using a TreeMap (or multiple TreeMaps if you need to also do reverse indexing), and just serialize it if you need to save to disk.

Kevin Day
  • 16,067
  • 8
  • 44
  • 68
  • +1 for an embedded db. From the comments it seems that a classical back-forth indexed table is enough for the use cases. – akarnokd Jun 24 '09 at 21:31