Apache Geode - Query performance on joins

Question

I am using Apache Geode as a caching solution. I have a requirement to store data within 2 different regions and retrieve them with a simple join query.

I have tried both replicated as well as partitioned regions but have found that the query takes a long time to return results. I have added indexes on the both regions as well which has improved the performance but is still not fast enough. Can someone please help on how to improve the performance on this query.

Here is what I have tried

Example 1 - PARTITIONED REGIONS

Time taken to retrieve about 7300 records from the cache was 36 seconds

Configuration in cache.xml

<region name="Department">
    <region-attributes>
        <partition-attributes redundant-copies="1">
        </partition-attributes>
    </region-attributes>
    <index name="deptIndex" from-clause="/Department" expression="deptId"/>
</region>

<region name="Employee">
    <region-attributes>
        <partition-attributes redundant-copies="1" colocated-with="Department">
        </partition-attributes>
    </region-attributes>
    <index name="empIndex" from-clause="/Employee" expression="deptId"/>
</region>

QueryFunction

@Override
public void execute(FunctionContext context) {
// TODO Auto-generated method stub
Cache cache = CacheFactory.getAnyInstance();
QueryService queryService = cache.getQueryService();

ArrayList arguments = (ArrayList)context.getArguments();
String queryStr = (String)arguments.get(0);

Query query = queryService.newQuery(queryStr);

try {
    SelectResults result = (SelectResults)query.execute((RegionFunctionContext)context);

    ArrayList arrayResult = (ArrayList)result.asList();
    context.getResultSender().sendResult(arrayResult);
    context.getResultSender().lastResult(null);
} catch (FunctionDomainException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (TypeMismatchException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (NameResolutionException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
} catch (QueryInvocationTargetException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}

}

Executing the function

Function function = new QueryFunction();
String queryStr = "SELECT * FROM /Department d, /Employee e WHERE d.deptId=e.deptId";
ArrayList argList = new ArrayList();
argList.add(queryStr);
Object result = FunctionService.onRegion(CacheFactory.getAnyInstance().getRegion("Department")).withArgs(argList).execute(function).getResult();

ArrayList resultList = (ArrayList)result;
ArrayList<StructImpl> finalList = (ArrayList)resultList.get(0);

Example 2 - REPLICATED REGIONS

Time taken to retrieve about 7300 records from cache was 29 seconds

Configuration in cache.xml

<region name="Department">
    <region-attributes refid="REPLICATE">
    </region-attributes>
    <index name="deptIndex" from-clause="/Department" expression="deptId"/>
</region>

<region name="Employee">
    <region-attributes refid="REPLICATE">
    </region-attributes>
    <index name="empIndex" from-clause="/Employee" expression="deptId"/>
</region>

Query

@Override
public SelectResults fetchJoinedDataForIndex() {
    QueryService queryService = getClientcache().getQueryService();
    Query query = queryService.newQuery("SELECT * FROM /Department d, /Employee e WHERE d.deptId=e.deptId");
    SelectResults result = null;
    try {
        result = (SelectResults)query.execute();
        System.out.println(result.size());
    } catch (FunctionDomainException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (TypeMismatchException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (NameResolutionException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (QueryInvocationTargetException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
    return result;
}

Swapnil · Accepted Answer · 2016-06-30T02:45:54.207

Can you please describe your domain objects? what are the keys and values in the Employees and Department regions? Are you using PDX?

One simple approach could be to make deptId as the key for the department region. Then in your function, you can just iterate over the Employee region and do a get(deptId) on the Department region. In order to reduce latency further, you can send a chunk of results back to the client, while your server keeps running the function. Since you mention that you have more than 7000 entries in the result, you can batch up 500 at a time from the server. Something like this:

@Override
public void execute(FunctionContext context) {
  RegionFunctionContext rfc = (RegionFunctionContext) context;
  Region<EmpId, PDXInstance> employee = PartitionRegionHelper.getLocalPrimaryData(rfc.getDataSet());
  Region<DeptId, PDXInstance> department = PartitionRegionHelper.getLocalPrimaryData(rfc.getDataSet());
  int count = 0;
  Map<PdxInstance, PdxInstance> results = new HashMap<>();
  for (Region.Entry<EmpId, PDXInstance> e : employee.entrySet()) {
    PdxInstance dept = department.get(e.getValue().get("deptId"));
    results.put(e.getValue(), dept);
    if (count == 500) {
      context.getResultSender().sendResult(results);
      results.clear();
      count = 0;
    }
  }
  context.getResultSender().lastResult(results);
}

Then on the client you can use a custom result collector that will be able to process the results chunk-by-chunk as they arrive from the server.

The above results that I have tried are using the java serialization. I changed this to use PDX and the results were considerably faster. The results for replicated regions with PDX was 470 milliseconds now. So it does look like Serialization has a huge impact on the performance. — Pratibha, Jun 29 '16 at 11:41
The key on my Employee object is the empId and on my Department object is the deptId and I did create indexes on deptId in both Employee as well as Department. Is there anything else I can do to improve the performance? Any other Serialization methods that have been found to provide better results? — Pratibha, Jun 29 '16 at 11:42
My comment was too big to fit here, so ended up editing my original answer. — Swapnil, Jun 30 '16 at 02:46

Apache Geode - Query performance on joins

1 Answers1