16

I'm trying to implement a simplified in-memory cached "table" where there are 2 types of indexes: primary and secondary.

  • Primary index maps a single key (primary key) to a unique value (Map interface)

  • Secondary index maps a single key to a Collection of values (Multimap fits the bill)

Very similar to a table in RDBMS world where one has several lookup columns. Sometimes you want to search by PK, sometimes return a list of rows based on a common property. Right now, there is no need in other operations than equals (=) (ie no range queries, or pattern matching).

Add cache semantics to the above data structure (eviction, data population/cache loader, refresh etc.) and that's pretty much what is needed.

I would like to ask your advice on how to best approach given problem. Should it be Cache per index or Cache (for PK) + (synchronized) Multimap for secondary indexes ?

Any help is much appreciated.

Regards.

Andrei
  • 173
  • 1
  • 4
  • Andrei, how helpful has the answer given been? – Kevin Bourrillion Jun 13 '12 at 05:02
  • sorry, but i cannot see how you pretend to cache your data structure. As fair as i see, if you will search by PK it will access your map and find the correct element in O(1) (if everything is ok) right? What you want to cache? – Plínio Pantaleão Jun 13 '12 at 17:20
  • Hi Plinio. Let's say I try to cache users. They might have `userId` as well as `group` attributes. When I search by `userId` only 1 element will be returned since it's PK. Look-up by `group` would return a `Collection` (eg. all users in admin group). Adding a User{userId:123, group:'admin'} would update both indexes – Andrei Jun 13 '12 at 23:41
  • @kevin-bourrillion don't have a clear answer yet. – Andrei Jun 13 '12 at 23:57

3 Answers3

2

You can replace a Map with a Guava com.google.common.cache.Cache. It doesn't support Multimap type semantics , so you'd have to use

Cache<K, ? extends List<V>> 

in that case.

For the sake of simplicity I would make the 'primary index' a subset of the secondary index - i.e. you have a single index that returns a list of values for a given key and primary keys just return a list with a single value.

vladimir e.
  • 723
  • 3
  • 7
  • Hi Vladimir, the question is more on how to keep those structures synchronized. A change in one Cache should imply a modification in another. Adding element by primary key should update secondary indexes etc. Since I don't need ACID properties I was hoping that something faster than in-memory RDBMS can be achieved. – Andrei Jun 13 '12 at 23:26
  • @Andrei But what kind of consistency and synchronization do you need? If you query your cache by a secondary index, do you want all corresponding rows? Probably yes. If a row gets loaded by a primary key, should it be put in the matching secondary indexes somehow? If a row gets accessed using one index, should its expiration from all others be prevented? – maaartinus Jun 14 '12 at 05:03
  • @maaartinus probably the expiration should be set only on PK + RemovalListener which does it for secondary indexes. Now I'm thinking if it makes more sense to have secondaryIndex contain Collection as values and perform a second lookup based on PK to update eviction statistics. What's your opinion ? – Andrei Jun 14 '12 at 14:05
  • My opinion is "maybe". This means two accesses and maybe twice as much cache missed. In case you have a secondary index with 100 PKs and half of them is missing, you may end up issuing 50 separate queries to DB (sort of [the n+1 selects problem](http://stackoverflow.com/questions/97197/what-is-the-n1-selects-problem). OTOH, storing whole rows might take much more memory and I don't see how to handle eviction. Or maybe I do... I'll check back later. – maaartinus Jun 15 '12 at 09:24
  • If you're careful enough, one can keep all indexes synchronized (with RemovalListener) ie no cache misses from secondary key look-ups – Andrei Jun 15 '12 at 14:31
  • On eviction from the primary index you could remove all corresponding entries from all secondary indexes, but is it what you want? – maaartinus Jun 16 '12 at 01:47
1

The challenge here is to maintain the integrity of two indexes regardless of whether you use two cache or even one cache for PK + multimap.

May be you should create a new cache class (say TableCache) which extends com.google.common.cache.Cache, internally this class can maintain a multimap instance variable for the secondary index (which can be a ConcurrentHashMap).

Then you can override Cache methods (put, get, invalidate etc) to keep the secondary index in sync.

Of course, you have to provide a get function to retrieve values based on secondary index.

This approach gives you the ability to maintain the integrity of primary and secondary indexes.

public class TableCache<K, V> extends Cache<K, V> {

    Map<K, List<V>> secondaryIndex = new ConcurrentHashMap<K, List<V>>();

    public void put(K key, V value) {
        super.put(key, value);
        // Update secondaryIndex
    }

}
sperumal
  • 1,499
  • 10
  • 14
  • Sperumal, this is closer to what I need however there are still some issues to be solved. What to do when I want only to keep a subset of elements in memory ? Probably use a CacheLoader. The problem is that CacheLoader loads only by PK. What happens for searches like `group=admin` on elements which are not in LocalCache ? Does it mean a have to have a CacheLoader per index (including secondary indexes) ? – Andrei Jun 13 '12 at 23:56
  • In this data structure, we are trying to simulate a database algorithm. Based on the search key, whether its based on primary key or secondary key, the algorithm should load the data into memory in order to complete the requested query. – sperumal Jun 14 '12 at 03:07
  • Comment continued ... So we need two loaders, one to load data based on primary key and another to load based on secondary key. With primary key its straight forward, if its not in the memory it loads the key from datastore. If the query is based on secondary key, then loader should be smart enough to load the data from datastore. – sperumal Jun 14 '12 at 03:15
  • Sperumal, yes it seems that we need several CacheLoaders (Caches) however there is still one issue with eviction policy. As @maaartinus questioned, do we remove elements from primary/secondary indexes (caches) independently or enforce eviction just from PK ? Probably the later. Now the implementation becomes trickier and I'm not sure if I'm abusing guava cache (or K/V store in general) for such a usecase – Andrei Jun 14 '12 at 14:20
  • Simpler solution is to create a CompositeCache which can maintain two caches one for primary and one for secondary. CompositeCache can listen to the changes made to individual cache and update the other. In this approach, you can keep subset of the data in memory without any complicated algorithm. – sperumal Jun 14 '12 at 14:43
  • But I still need to query by a custom attribute (PK or other). Example get all users where `group=admin` Does it mean that CompositeCache implements `Cache, Collection>` ? – Andrei Jun 14 '12 at 15:09
  • CompositeCache will not extend or implement Cache, but have instance variables for Primary Cache and Secondary Cache. PK queries will be forwarded to Primary Cache and 'group' queries will be forwarded to Secondary cache. – sperumal Jun 14 '12 at 15:19
  • Now the problem is with loaders for secondary indexes. I can't tell for sure if my local cache has all elements where `group=admin` (perhaps just a subset). Does it mean that I always have to hit data store for secondary look-ups ? It somewhat defeats the purpose of the cache. – Andrei Jun 15 '12 at 14:39
0

I have had this problem many times myself.

What would fix this problem is if Java had better STM support. It is very difficult to make non blocking atomic data structures. The best I have seen is multiverse.

Thus @vladimir's answer is probably the best but I would say that the stored collections should be immutable and you will have to retrieve the whole collection on refresh/cache miss etc.... Also if you change one of the members of multiset your going to have a tough time knowing how to update its parent and invalidate the cache.

Otherwise I would consider something like Redis for larger data sets which does support atomic operations on maps and list combinations.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203