Returning a count of and all complete paths in Neo4j

Question

Being an absolute noob in neo4j and having had very generous help with a previous question, I thought I'd try my luck once again as I'm still struggling.

The example scenario is that of students that enters a house and walks from one room to another. The journey doesn't have to start or end at a particular room but the order of sequence that a student enters a room is important.

What I want to find out is all the complete paths that students have taken along with a count of how many times the path in question was taken. Below is the sample data and what I've tried (thanks to the answer of a previous question along with a series of blog posts):

the file dorm.csv

ID|SID|EID|ROOM|ENTERS|LEAVES
1|1|12|BLUE|1/01/2015 11:00|4/01/2015 10:19
2|2|18|GREEN|1/01/2015 12:11|1/01/2015 12:11
3|2|18|YELLOW|1/01/2015 12:11|1/01/2015 12:20
4|2|18|BLUE|1/01/2015 12:20|5/01/2015 10:48
5|3|28|GREEN|1/01/2015 18:41|1/01/2015 18:41
6|3|28|YELLOW|1/01/2015 18:41|1/01/2015 21:00
7|3|28|BLUE|1/01/2015 21:00|9/01/2015 9:30
8|4|36|BLUE|1/01/2015 19:30|3/01/2015 11:00
9|5|40|GREEN|2/01/2015 19:08|2/01/2015 19:08
10|5|40|ORANGE|2/01/2015 19:08|3/01/2015 2:43
11|5|40|PURPLE|3/01/2015 2:43|4/01/2015 16:44
12|6|48|GREEN|3/01/2015 11:52|3/01/2015 11:52
13|6|48|YELLOW|3/01/2015 11:52|3/01/2015 17:45
14|6|48|RED|3/01/2015 17:45|7/01/2015 10:00

creating nodes for Student, Room and Visit where Visit is the event of a student entering a room uniquely identified by the ID property

CREATE CONSTRAINT ON (student:Student) ASSERT student.studentID IS UNIQUE;
CREATE CONSTRAINT ON (room:Room) ASSERT room.roomID IS UNIQUE;
CREATE CONSTRAINT ON (visit:Visit) ASSERT visit.visitID IS UNIQUE;


USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
MERGE (visit:Visit {visitID: line.ID, roomID: line.ROOM, studentID: line.SID, ticketID: line.EID})
create (student)-[:VERB]->(visit)-[:OBJECT]->(room)

Creating a PREV relationship allows the ordering or sequencing that the student travels in. This uses data in the file dormprev.csv. If a student has only visited a single room, this ID will not appear in the dormprev file as its purpose is to link/chain visits. Data as below

ID|PREV_ID|EID
3|2|18
4|3|18
6|5|28
7|6|28
10|9|40
11|10|40
13|12|48
14|13|48

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dormprev.csv" as line fieldterminator '|'
MATCH (new:Visit {visitID: line.ID})
MATCH (old:Visit {visitID: line.PREV_ID})
MERGE (new)-[:PREV]->(old)

I can view all student journeys by the below query

MATCH (student:Student)-[:VERB]->(visit:Visit)-[:OBJECT]-(room:Room)
RETURN student, visit, room

However, I have no idea how to return all of the rooms in a complete path.

if I run this query

MATCH p = (:Visit)<-[:PREV]-(:Visit) return p

I can see that it, for example, for student ID 2 returns Green and Yellow and then Yellow and Blue as a separate pair - I want to view that as Green, Yellow, Blue

This also means that if I run the below query:

MATCH p = (:Visit)<-[:PREV]-(:Visit)
WITH p, EXTRACT(v IN NODES(p) | v.roomID) AS rooms
UNWIND rooms AS stays
WITH p, COUNT(DISTINCT stays) AS distinct_stays
WHERE distinct_stays = LENGTH(NODES(p))
RETURN EXTRACT(v in NODES(p) | v.roomID), count(p)
ORDER BY count(p) DESC

it will return a count of those pairings rather than count of "whole paths" if that makes sense.

For example, SID 2 and SID 3 both visit rooms GREEN, YELLOW, BLUE in that order. SID 5 visits GREEN, ORANGE, PURPLE in that order.

What I'm hoping to see is:

[GREEN, YELLOW, BLUE] 2
[GREEN, ORANGE, PURPLE] 1

etc. Is that possible with the above model and if so can anyone please help point me in the right direction? The number of rooms that are visited is not guaranteed and can be anything from one to *. However, if only one room is visited, that's not really of interest and so is the reason why I thought this model might make sense (again, stolen from a blog post series).

I don't know if the above makes sense but any help would be much appreciated - this makes for an excellent use case and would be really useful.

Thank you for your kind help.

score 0 · Accepted Answer · answered May 24 '17 at 05:54

0

What I think you are looking for is variable path length. And you can accomplish that by merely changing this in your query (note the asterisk) :

MATCH p = (:Visit)<-[:PREV*]-(:Visit)

Do allow me a couple of further remarks. Yes, I understand the convenience of having roomID and studentID in the Visit node (keeps this specific query quite a bit simpler), but you are ignoring the whole point of having relationships in the first place (in fact, if you do it this way there's currently actually no point in having the Student and Room nodes at all) and you are going to have trouble maintaining them. Secondly ... if we are going to be splitting the proverbial 3rd normal form hairs ;-), then the relations for a Visit should actually be created as follows (note the direction of the relationships) :

CREATE (student)-[:VERB]->(visit)<-[:OBJECT]-(room)

Other than that I must say you're moving very fast :-)

Hope this helps, Tom

answered May 24 '17 at 05:54

Tom Geudens

2,638
9
15

Thanks heaps Tom, much appreciated! There's a joke to be made about the point of relationships and I'm showing remarkable restraint by refraining ;-) I've had to side track onto something else at the moment but am hoping to look at this on the weekend. Regarding changing the query to variable path length, I guess that would almost solve the problem but wouldn't that also include sub-paths whereas I am interested in only the "complete" journey? That is, G Y B would actually be counted three times with this query, [{G, Y},{Y,B},{GYB}] instead of just GYB. Thanks heaps Tom. Will post back Monday – c95mbq May 25 '17 at 23:13
Yes, there definitely is "a point to relationships" :-). Anyway, you might want to add some extra requirements so the starting point does not have an incoming [:PREV] and the ending point does not have an outgoing [:PREV]. That would make sure you only get complete journeys. – Tom Geudens May 26 '17 at 04:31
I ended up changing things and created a NEXT instead of PREV. Your comment along with another SO post I came across resulted in the below query, which seems to return the expected result, at least with the small sample data: `MATCH p = (from:Visit)-[:NEXT*]->(to:Visit) where not((to)-[:NEXT]->()) and not(()-[:NEXT]->(from)) WITH p, EXTRACT(v IN NODES(p) | v.roomID) AS rooms UNWIND rooms AS stays WITH p, COUNT(DISTINCT stays) AS distinct_stays WHERE distinct_stays = LENGTH(NODES(p)) RETURN EXTRACT(v in NODES(p) | v.roomID), count(p)` Thanks for all your help! – c95mbq May 30 '17 at 04:06

score 0 · Answer 2 · answered May 24 '17 at 08:28

0

Building a bit on Tom's suggestions, you might consider an alternate model doing away with :Visit nodes completely, and making your relationship types a bit more focused, like this:

(:Student)-[:VISITED]->(:Room)

You can set entered and left properties on the :VISITED relationship, which will allow you to order the relationships (and corresponding :Rooms) in visited order.

Here's an alternate import that will do this, using APOC Procedures (you'll have to install the correct version corresponding with your Neo4j version) to parse out timestamps from your date strings.

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:///dorm.csv" as line fieldterminator '|'
MERGE (student:Student {studentID: line.SID})
MERGE (room:Room {roomID: line.ROOM})
WITH student, room, apoc.date.parse(line.ENTERS, 'ms', 'MM/dd/yyyy HH:mm') as entered, apoc.date.parse(line.LEAVES, 'ms', 'MM/dd/yyyy HH:mm') as left
CREATE (student)-[r:VISITED]->(room)
SET r.entered = entered, r.left = left

And now your query to get all paths and the number of students who have taken those paths becomes very easy:

MATCH (s:Student)-[v:VISITED]->(r:Room)
WHERE size((s)-[:VISITED]->()) > 1
WITH s, r
ORDER BY v.entered ASC
WITH s, collect(r.roomID) as rooms
RETURN rooms, count(s)

answered May 24 '17 at 08:28

InverseFalcon

29,576
4
38
51

Thanks for your help InverseFalcon, much appreciated. I haven't had a chance to test this yet but am hoping to do so in the next couple of days and will post back. Yours and Tom's help is unbelievably important for noobs such as myself. Having worked with SQL for such a long time, I'm struggling getting my head around this and am still waiting and hoping for the light bulb moment. Seems to be such undeniable value that it's worth persisting and such friendly and generous help goes a long way. Will post back as soon as I've had a chance to get back to this. Cheers – c95mbq May 25 '17 at 23:19
Glad to help! You may also want to consider the [neo4j-users slack](https://neo4j.com/developer/slack/) for additional resources and advice. Also make sure to go through some of the built-in tutorials offered through the neo4j browser itself, those are quite rich! – InverseFalcon May 25 '17 at 23:38
Thanks InverseFalcon. I've registered on slack and am sure it'll be of great use. I think I grabbed the wrong version apoc as I wasn't able to test your query but am definitely revisiting it when things die down a bit. Torn whether to accept yours or Tom's answer but went with Tom's, wish I could pick both! Thanks for all your help, again, will revisit apoc and your solution when things are quieter. Cheers – c95mbq May 30 '17 at 04:10
Belated comment but couldn't get apoc to work until I came across this post https://stackoverflow.com/questions/36897634/there-is-no-procedure-with-the-name-apoc-help-registered-for-this-database-ins/36934404#36934404 – c95mbq Jun 20 '17 at 05:39

Returning a count of and all complete paths in Neo4j

2 Answers2