What is the difference between these two Cypher queries?

Question

I'm a bit stumped.

In my database, I have a relationship like this:

(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)

I want to perform a query where for a given user, I find the common artists between that user and every other user.

To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.

The first query I was trying was the following:

START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")

OPTIONAL MATCH 
    pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
    other:User

WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists

ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists

This was taking around 20 seconds to return. The profile for this query is as follows:

+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
|             Operator |  Rows | DbHits |            Identifiers |                                                                                          Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
|      ColumnFilter(0) |    10 |      0 |                        |                                                     keep columns other.username, mutualArtists |
|              Extract |    10 |     20 |                        |                                                                                 other.username |
|      ColumnFilter(1) |    10 |      0 |                        |                                                              keep columns other, mutualArtists |
|                  Top |    10 |      0 |                        | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
|     EagerAggregation |   563 |      0 |                        |                                                                                          other |
|        OptionalMatch | 52806 |      0 |                        |                                                                                                |
|             Eager(0) |   563 |      0 |                        |                                                                                                |
|  NodeByIndexQuery(1) |   563 |    564 |           other, other |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |                 me, me |                                                                          Literal(List(553314)) |
|             Eager(1) |    82 |      0 |                        |                                                                                                |
|          ExtractPath |    82 |      0 |         pMutualArtists |                                                                                                |
|            Filter(0) |    82 |     82 |                        |                                                    (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher |    82 |     82 | a, me, ar2, ar1, other |                                                                                                |
|            Filter(1) |     1 |      3 |                        |               ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
|  NodeByIndexQuery(1) |   563 |    564 |           other, other |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |                 me, me |                                                                          Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+

I was frustrated. It didn't seem like this should take 20 seconds.

I came back to the problem later on, and tried debugging it from the start.

I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.

I finally added things back, and ended up with the following query:

START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")

OPTIONAL MATCH 
    pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
    u2:User

WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists

ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists

This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:

+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
|             Operator |  Rows | DbHits |        Identifiers |                                                                                          Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
|      ColumnFilter(0) |    10 |      0 |                    |                                                        keep columns u2.username, mutualArtists |
|              Extract |    10 |     20 |                    |                                                                                    u2.username |
|      ColumnFilter(1) |    10 |      0 |                    |                                                                 keep columns u2, mutualArtists |
|                  Top |    10 |      0 |                    | {  AUTOINT0}; Cached(  INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
|     EagerAggregation |   563 |      0 |                    |                                                                                             u2 |
|        OptionalMatch | 52806 |      0 |                    |                                                                                                |
|             Eager(0) |   563 |      0 |                    |                                                                                                |
|  NodeByIndexQuery(1) |   563 |    564 |             u2, u2 |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |               u, u |                                                                          Literal(List(553314)) |
|             Eager(1) |    82 |      0 |                    |                                                                                                |
|          ExtractPath |    82 |      0 |     pMutualArtists |                                                                                                |
|            Filter(0) |    82 |     82 |                    |                                                    (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher |    82 |     82 | a, u2, u, ar2, ar1 |                                                                                                |
|            Filter(1) |     1 |      3 |                    |                      ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
|  NodeByIndexQuery(1) |   563 |    564 |             u2, u2 |                             Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
|          NodeById(1) |     1 |      1 |               u, u |                                                                          Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+

And, to prove that I ran them both in a row and got very different results:

neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
>     pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
>     u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username  | mutualArtists |
+------------------------------+
| "573904765"  | 644           |
| "28600291"   | 601           |
| "1092510304" | 558           |
| "1367963461" | 521           |
| "1508790199" | 455           |
| "1335360028" | 447           |
| "18200866"   | 444           |
| "1229430376" | 435           |
| "748318333"  | 434           |
| "5612902"    | 431           |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
>     pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
>     other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765"    | 644           |
| "28600291"     | 601           |
| "1092510304"   | 558           |
| "1367963461"   | 521           |
| "1508790199"   | 455           |
| "1335360028"   | 447           |
| "18200866"     | 444           |
| "1229430376"   | 435           |
| "748318333"    | 434           |
| "5612902"      | 431           |
+--------------------------------+
10 rows
20418 ms

Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").

Why does that cause a 5X improvement??!?!

If anyone has any insight into this, I would be eternally grateful.

Thanks,

-Adam

EDIT 8.1.14

Based on @ulkas's suggestion, I tried simplifying the query.

The results were:

START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10

~4 seconds

START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10

~20 seconds

So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.

Thanks, -Adam

score 1 · Answer 1 · answered Aug 01 '14 at 06:22

1

That sounds like you're seeing the effect of caching. Upon the first access the cache is not populated. Subsequent queries hitting the same graph will be much faster since the nodes/relationships are already available in the cache.

answered Aug 01 '14 at 06:22

Stefan Armbruster

39,465
6
87
97

1

Thanks for your response. This is exactly what I thought it was. But I can run the long query (the one with me and other) many times in a row and get the same results, and running the other query several times in a row yields the same results as well. Also, these results persist between reboots of the server. – Adam Miskiewicz Aug 01 '14 at 14:24

score 1 · Answer 2 · answered Aug 01 '14 at 10:31

1

working with OPTIONAL MATCH following WHERE other:User has no sense, since the end node other (u2) must be match. try to perform the queries without optional match and where and without the last with, simply

START me=node(553314), other=node:userLocations("withinDistance[38.89037,-77.03196,80.467]")
 MATCH
     pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)   
 RETURN other.username, count(DISTINCT pMutualArtists) as mutualArtists
 ORDER BY mutualArtists DESC
 LIMIT 10

answered Aug 01 '14 at 10:31

ulkas

5,748
5
33
47

Thanks for your response. I agree that the "WHERE" and "WITH" clauses are unnecessary. They were things I was trying. However, for my specific use case, the "OPTIONAL MATCH" is necessary because I need to return the other users who have no artists in common, which "MATCH" does not return. I will edit my post above and tell you what I've tried based on your comment...TL;DR...same problem. – Adam Miskiewicz Aug 01 '14 at 14:31
i see, i'm also confused why is it like this. watching this question. – ulkas Aug 04 '14 at 07:51

What is the difference between these two Cypher queries?

2 Answers2