I'm a bit stumped.
In my database, I have a relationship like this:
(u:User)-[r1:LISTENS_TO]->(a:Artist)<-[r2:LISTENS_TO]-(u2:User)
I want to perform a query where for a given user, I find the common artists between that user and every other user.
To give an idea of size of my database, I have about 600 users, 47,546 artists, and 184,211 relationships between users and artists.
The first query I was trying was the following:
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
WHERE
other:User
WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN other.username, mutualArtists
This was taking around 20 seconds to return. The profile for this query is as follows:
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns other.username, mutualArtists |
| Extract | 10 | 20 | | other.username |
| ColumnFilter(1) | 10 | 0 | | keep columns other, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEb6facb18-1c5d-45a6-83bf-a75c25ba6baf of type Integer) |
| EagerAggregation | 563 | 0 | | other |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, me, ar2, ar1, other | |
| Filter(1) | 1 | 3 | | ((hasLabel(me:User(3)) AND hasLabel(other:User(3))) AND hasLabel(other:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | other, other | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | me, me | Literal(List(553314)) |
+----------------------+-------+--------+------------------------+------------------------------------------------------------------------------------------------+
I was frustrated. It didn't seem like this should take 20 seconds.
I came back to the problem later on, and tried debugging it from the start.
I started to break down the query, and I noticed I was getting much faster results. Without the Neo4J Spatial query, I was getting results in about 1.5 seconds.
I finally added things back, and ended up with the following query:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH
pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
WHERE
u2:User
WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
RETURN u2.username, mutualArtists
This query returns in 4240 ms. A 5X improvement! The profile for this query is as follows:
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| Operator | Rows | DbHits | Identifiers | Other |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
| ColumnFilter(0) | 10 | 0 | | keep columns u2.username, mutualArtists |
| Extract | 10 | 20 | | u2.username |
| ColumnFilter(1) | 10 | 0 | | keep columns u2, mutualArtists |
| Top | 10 | 0 | | { AUTOINT0}; Cached( INTERNAL_AGGREGATEbdf86ac1-8677-4d45-967f-c2dd594aba49 of type Integer) |
| EagerAggregation | 563 | 0 | | u2 |
| OptionalMatch | 52806 | 0 | | |
| Eager(0) | 563 | 0 | | |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
| Eager(1) | 82 | 0 | | |
| ExtractPath | 82 | 0 | pMutualArtists | |
| Filter(0) | 82 | 82 | | (hasLabel(a:Artist(1)) AND NOT(ar1 == ar2)) |
| SimplePatternMatcher | 82 | 82 | a, u2, u, ar2, ar1 | |
| Filter(1) | 1 | 3 | | ((hasLabel(u:User(3)) AND hasLabel(u2:User(3))) AND hasLabel(u2:User(3))) |
| NodeByIndexQuery(1) | 563 | 564 | u2, u2 | Literal(withinDistance:[38.89037,-77.03196,80.467]); userLocations |
| NodeById(1) | 1 | 1 | u, u | Literal(List(553314)) |
+----------------------+-------+--------+--------------------+------------------------------------------------------------------------------------------------+
And, to prove that I ran them both in a row and got very different results:
neo4j-sh (?)$ START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
> WHERE
> u2:User
>
> WITH u2, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN u2.username, mutualArtists
> ;
+------------------------------+
| u2.username | mutualArtists |
+------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+------------------------------+
10 rows
4240 ms
neo4j-sh (?)$ START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
>
> OPTIONAL MATCH
> pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
> WHERE
> other:User
>
> WITH other, COUNT(DISTINCT pMutualArtists) AS mutualArtists
> ORDER BY mutualArtists DESC
> LIMIT 10
> RETURN other.username, mutualArtists;
+--------------------------------+
| other.username | mutualArtists |
+--------------------------------+
| "573904765" | 644 |
| "28600291" | 601 |
| "1092510304" | 558 |
| "1367963461" | 521 |
| "1508790199" | 455 |
| "1335360028" | 447 |
| "18200866" | 444 |
| "1229430376" | 435 |
| "748318333" | 434 |
| "5612902" | 431 |
+--------------------------------+
10 rows
20418 ms
Unless I have gone crazy, the only difference between these two queries is the names of the nodes (I've changed "me" to "u" and "other" to "u2").
Why does that cause a 5X improvement??!?!
If anyone has any insight into this, I would be eternally grateful.
Thanks,
-Adam
EDIT 8.1.14
Based on @ulkas's suggestion, I tried simplifying the query.
The results were:
START u=node(553314), u2=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(u:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(u2:User)
RETURN u2.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~4 seconds
START me=node(553314), other=node:userLocations("withinDistance:[38.89037,-77.03196,80.467]")
OPTIONAL MATCH pMutualArtists=(me:User)-[ar1:LISTENS_TO]->(a:Artist)<-[ar2:LISTENS_TO]-(other:User)
RETURN other.username, COUNT(DISTINCT pMutualArtists) as mutualArtists
ORDER BY mutualArtists DESC
LIMIT 10
~20 seconds
So bizarre. It seems as though literally the named nodes of "other" and "me" cause the query time to jump tremendously. I'm very confused.
Thanks, -Adam