34

I'm defining the relationship between two entities, Gene and Chromosome, in what I think is the simple and normal way, after importing the data from CSV:

MATCH (g:Gene),(c:Chromosome)
WHERE g.chromosomeID = c.chromosomeID
CREATE (g)-[:PART_OF]->(c);

Yet, when I do so, neo4j (browser UI) complains:

This query builds a cartesian product between disconnected patterns. If a part of a query contains multiple disconnected patterns, this will build a cartesian product between all those parts. This may produce a large amount of data and slow down query processing. While occasionally intended, it may often be possible to reformulate the query that avoids the use of this cross product, perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (c)).

I don't see what the issue is. chromosomeID is a very straightforward foreign key.

cybersam
  • 63,203
  • 6
  • 53
  • 76
Sam Hokin
  • 660
  • 1
  • 6
  • 15
  • Could you give an example of a particular match and their IDs? I'm trying to visualize the relationships you're creating. – jgloves Oct 26 '15 at 18:13
  • Also, do you have any other types of relationships besides [:PART_OF]? – jgloves Oct 26 '15 at 18:17
  • 1
    This is just a warning, and in your case there is nothing to do (due to the lack of relationship between these nodes, and that is exactly what you create in this query !). Warnings in neo4j browser have been added in 2.3, to notify user when he probably do a bad query (ie. with bad performance). – logisima Oct 26 '15 at 19:06

2 Answers2

54

The browser is telling you that:

  1. It is handling your query by doing a comparison between every Gene instance and every Chromosome instance. If your DB has G genes and C chromosomes, then the complexity of the query is O(GC). For instance, if we are working with the human genome, there are 46 chromosomes and maybe 25000 genes, so the DB would have to do 1150000 comparisons.
  2. You might be able to improve the complexity (and performance) by altering your query. For example, if we created an index on :Gene(chromosomeID), and altered the query so that we initially matched just on the node with the smallest cardinality (the 46 chromosomes), we would only do O(G) (or 25000) "comparisons" -- and those comparisons would actually be quick index lookups! This is approach should be much faster.

    Once we have created the index, we can use this query:

    MATCH (c:Chromosome)
    WITH c
    MATCH (g:Gene) 
    WHERE g.chromosomeID = c.chromosomeID
    CREATE (g)-[:PART_OF]->(c);
    

    It uses a WITH clause to force the first MATCH clause to execute first, avoiding the cartesian product. The second MATCH (and WHERE) clause uses the results of the first MATCH clause and the index to quickly get the exact genes that belong to each chromosome.

[UPDATE]

The WITH clause was helpful when this answer was originally written. The Cypher planner in newer versions of neo4j (like 4.0.3) now generate the same plan even if the WITH is omitted, and without creating a cartesian product. You can always PROFILE both versions of your query to see the effect with/without the WITH.

cybersam
  • 63,203
  • 6
  • 53
  • 76
  • Thanks! I get it! I do have indexes on all the IDs to speed things up, both primary keys and foreign keys, and there are some massive large x large queries, like ProteinMatch on Polypeptide, so the two-step match you suggest will help enormously in those cases. – Sam Hokin Oct 27 '15 at 12:41
  • 4
    For the record, this was the solution, and sped up the query massively. I guess I'm spoiled by automatic query optimization in DBs like PostgreSQL; looks like one needs to be a bit more careful in neo4j. – Sam Hokin Oct 27 '15 at 18:29
  • @cybersam - Is the keyword WITH really needed in your query? Can we just write consecutive MATCH keywords in separate lines? – CleanBold May 07 '20 at 15:25
  • @CleanBold No. But to double-check, use [PROFILE or EXPLAIN](https://neo4j.com/docs/cypher-manual/current/query-tuning/query-profile/) to compare the operations generated for alternate Cypher scripts. – cybersam May 01 '23 at 15:57
6

As logisima mentions in the comments, this is just a warning. Matching a cartesian product is slow. In your case it should be OK since you want to connect previously unconnected Gene and Chromosome nodes and you know the size of the cartesian product. There are not too many chromosomes and a smallish number of genes. If you would MATCH e.g. genes on proteins the query might blow.

I think the warning is intended for other problematic queries:

  • if you MATCH a cartesian product but you don't know if there is a relationship you could use OPTIONAL MATCH
  • if you want to MATCH both a Gene and a Chromosome without any relationships, you should split up the query

In case your query takes too long or does not finish, here is another question giving some hints how to optimize cartesian products: How to optimize Neo4j Cypher queries with multiple node matches (Cartesian Product)

Community
  • 1
  • 1
Martin Preusse
  • 9,151
  • 12
  • 48
  • 80