We have a Neo4J (5.2.0) database containing two types of nodes Topic
and Document
.
Topics and documents are linked by OCCURS_IN
relationships (:Topic)-[:OCCURS_IN]->(:Document)
and OCCURS_IN_RANDOM_SAMPLE
relationships (:Topic)-[:OCCURS_IN_RANDOM_SAMPLE]->(:Document)
where OCCURS_IN_RANDOM_SAMPLE
is a precomputed subset of OCCURS_IN
limited to 10000 per Topic/Document pair.
Topics are indexed by keyphrases.
The goal of our queries is to compute similarities between Topics based on their relationships with the documents. Our initial query was of the form:
UNWIND $topics as topic
MATCH (left:Topic {keyphrase: $left})-[:OCCURS_IN_RANDOM_SAMPLE]
->(document:Document)<-[:OCCURS_IN]
-(right:Topic {keyphrase: topic})
WITH right.keyphrase as right,
left.support as lsupport,
right.support as rsupport,
toFloat(count(document)) as intersection
...
with the constraint that the support of left
is lower than the support of every Topic in topics
.
Our fastest iteration of this query is:
UNWIND $topics as topic
MATCH (right:Topic {keyphrase: topic})
WITH right
MATCH (left:Topic {keyphrase: $left})-[:OCCURS_IN_RANDOM_SAMPLE]
->(document:Document)
WHERE EXISTS((document)<-[:OCCURS_IN]-(right))
WITH right.keyphrase as right,
left.support as lsupport,
right.support as rsupport,
toFloat(count(document)) as intersection
...
But even with it and the precomputed samples, it is still slow when comparing topics with supports of thousands of documents.
The bottleneck is oviously to compute this intersection count.
Is there a more efficient way to improve it? Subsidiary question: Does it normally make sense to use EXISTS
instead of matching the whole pattern or do we face a weird edge case where it works?
Thanks in advance :)