6

I have tagged objects in a Jackrabbit repository (actually Adobe/Day CQ's CRX, but I think this is the Jackrabbit code):

  • asset: tags = A, B
    • child asset data 1: tags = A, C, E
    • child asset data 2: tags = D, E

I want to query against the union of the parent asset's set of tags and one child, i.e. "B C" would match the asset because we have those in the parent and in child 1, but "C D" would not match because there's no combination of parent and one child that matches that because C and D are split across separate child data nodes.

Is there a way to do this in Jackrabbit? We can write an XPath query

\\element(*, dam:Asset)[(@tags = 'C' or *\@tags='C')
                        and (@tags = 'D' or *\@tags='D')]

but that won't work because XPath doesn't seem to guarantee that the * joined child assets are the same, i.e. this means "any child has C/D" and so will match my asset because 1+ children have a C and 1+ children have a D. Instead I could use JCR-SQL2

SELECT * FROM dam:Asset as asset
  LEFT OUTER JOIN nt:unstructured as child ON ISCHILDNODE(child,asset)
  WHERE (asset.tags = 'C' or child.tags = 'C')
    AND (asset.tags = 'D' or child.tags = 'D')

but there's no SELECT DISTINCT in JCR-SQL2: if instead I search for "B E" I will get this asset returned twice because this matches both asset+child1 and asset+child2.

I could postprocess either query result in Java, i.e. filter out false-positive matches for the first case or filter out duplicate results for the second case, but I'm nervous how this would affect paging performance: I'd need to scan more nodes than necessary to weed out bad nodes, and I'd need to scan the lot to compute the correct result size for paging. This ought to be cheaper for the second SQL2 case because if my search is ordered I can spot duplicates based on the node path alone and all duplicates will be consecutive, so I can find a given page's worth of data with cheap scanning only hopefully without reading the whole node for each result, but I don't know the cost of scanning all results for the paging count too even for the simple path-only case.

Yet another option we've considered is denormalising the tags into a single node. In this case, to keep the search accurate, that would have to mean creating a new combined_tags attribute in each child node and perform all searches against the set of child nodes only. However this still suffers from the distinct problem should we match two child nodes beneath the same asset.

Thanks for any suggestions. This is a large instance already and will need to scale further. I've seen other questions which say ModeShape is a JCR implementation that does have SELECT DISTINCT but I think switching to ModeShape just for that would have to be the last resort, if indeed it's possible to host CQ on ModeShape.


One idea we've come up with now is to compute each union of the asset tags and child tags and combine the tags into a single string then write each value as a multivalued property of the asset, i.e. asset + child1 = "A B C E" and asset + child2 = "A B D E", so we get

  • asset: tags = A, B; tagUnions = "A B C E", "A B D E"

As long as we define a fixed order for combining tags into a string (e.g. alphabetical) we can the search for any combination using tagUnions LIKE '%B%C%' (except I'd use proper delimiters between tags in the real case). Whilst this will work as far as we can see I don't really like it: there's potentially large numbers of tags per asset+child, all with longer names than single letters meaning we'll end up with long strings executing LIKE queries on all of them which likely can't be indexed efficiently.

Another take on this is to make a bitmask: define A=1, B=2 etc. and so store a multivalued integer array here then carry out a bitwise comparison. However that's likely limited to 64 different tags and since we have 1,000+ I don't think we can do this - even if JCR supports bitwise operations, which I'd expect it won't.

So I'm still on the lookout for a clean database-like solution for this. You've missed the bounty I put up but there's still ticks, votes and gratitude for any help.

Rup
  • 33,765
  • 9
  • 83
  • 112

1 Answers1

1

From the Apache Jackrabbit mailing list:

Yes, unfortunately union queries are not supported. Any work on that area would be much appreciated.

Meanwhile the best workaround is probably to do two separate queries and to explicitly perform the union in the application code by combining the two result sets.

So, that's out as an option. Looking at the SQL you've provided:

but there's no SELECT DISTINCT in JCR-SQL2: if instead I search for "B E" I will get this asset returned twice because this matches both asset+child1 and asset+child2.

I looked at the possible solutions supported by Jackrabbit and came up empty handed. However, I agree with the solution presented here:

What I did is to do a simple SELECT with appropriated ORDER BYs... then each time I used a row, I veried that it isn't the same as the previous :-)

(Sics preserved.)

While the ORDER BY is potentially dubious unless you require database-backed sorting, is there anything preventing you from constructing a hashset in your controller to limit your results to only unique values using the JCR API?

MrGomez
  • 23,788
  • 45
  • 72
  • Thanks. It's not actually a SQL `UNION` I need in terms of a set union across two queries but I'm computing a match against a logical union of two properties across from different nodes so it's a SQL `JOIN` and `SELECT DISTINCT` that I need. The solution you link - order by and remove consecutive duplicates - is one of the ideas I mentioned in the paragraph about post-processing results, and the issue with that is getting paging right: I'd need to scan all records up to the current page to work out where the page actually starts, and scan everything to get the accurate number of total pages. – Rup Mar 29 '12 at 00:27
  • ... and the system I'm working with has millions of assets, so 10,000+ results from a simple query isn't unheard of - I can't assume I have a small number of results as the guy says he has in that linked solution. I do require database-backed sorting in order to get efficient paging I think. In any case the Jackrabbit docs recommend you use an `ORDER BY` anyway since the JCR default order (unless disabled in repository.xml) is potentially expensive to compute. – Rup Mar 29 '12 at 00:31
  • @Rup Thank you for the update. As you mention, post-processing the results in Java is possible, but may be potentially expensive as you traverse additional nodes that you otherwise have already visited. So, this comes down to a question of efficient traversal through your data structure. Hmn. I'll have to look at this later and get back to you. :) – MrGomez Mar 29 '12 at 00:39
  • @Rup Thinking about this in terms of the shape of your graph, I realize that the most efficient solutions are to construct an index specifically for your query [using tools tailored to it](http://lucene.apache.org/core/features.html), or to otherwise collapse child nodes in with the parent as a form of hash that JCR doesn't seem to expose to its query layer. So, I recommend going with Lucene. Have you looked into it already? – MrGomez Mar 29 '12 at 09:08
  • Yes, one of the alternative solutions we've got is to build a separate index for the tags in a regular database and run the search on that. That works well except that it relies on update events from CQ's publishing mechanism to know when to re-index the tags on a given asset and we're having reliability and performance problems with the Sling event queues. However technical leadership here want to reduce the number of parts we need to maintain and want a Jackrabbit-only solution for search if at all possible. (It's meant as a framework for CMSes - I'm astounded it can't do complex searches!) – Rup Mar 29 '12 at 09:17
  • And Jackrabbit already has an embedded Lucene for searches. I don't know if it's possible to access directly to do more complicated things with but I'd assumed all Lucene function would already be exposed through Jackrabbit. (If not, we may be able to implement that ourselves I guess but that might be a tough sell to management and to Adobe if we want a support contact for our customised instance.) – Rup Mar 29 '12 at 09:21
  • @Rup So were we. We use Lucene pretty heavily internally, because it's effectively the missing search layer for JCR. I respect the need to reduce moving parts... but for code maintainability reasons, it may be worth further exploring if at all possible. (And from what I understand, it still needs to be stapled to JCR, excepting a few components as [here](http://wiki.apache.org/jackrabbit/Search). I imagine ripping additional components out may be plausible, though!) – MrGomez Mar 29 '12 at 09:21
  • Thanks for your help. I still don't have a solution but the bounty's yours rather than wasting it. – Rup Apr 04 '12 at 15:26
  • @Rup I feel kind of bad about that. I think what I'm going to do here is escalate this to coworkers that've worked with Jackrabbit more heavily than I have, to see if they can find a solution that doesn't rely on Lucene. I think you have the right idea here, of satisfying some mathematical property when a _DISTINCT_ clause isn't present and your search tools are limited. But, I suspect that will reduce itself to optimizing a [unique paths depth-first search](http://stackoverflow.com/questions/58306/graph-algorithm-to-find-all-connections-between-two-arbitrary-vertices) in your current DB. – MrGomez Apr 04 '12 at 17:04
  • 1
    No, no problem - thanks for trying! We're exploring other options than Jackrabbit here so I may not need a Jackrabbit-based solution anyway. – Rup Apr 04 '12 at 17:40