2

I am evaluating the Neo4j Community version with data that have millions of nodes and relationships. I wrote a thread application to write data in parallel. Four threads are working parallelly and the total write time got reduced by 4 times but the number of nodes written to DB also got reduced by 4.

Seems like parallel writing is not happening. There are no dependencies between data running in each thread and it does not also throw an error.

It is behaving like this

  1. I write 100k nodes in 1 hour without threading
  2. I try to write 100k nodes with 4 threads. It is done in 15 minutes but I see only 25k nodes are written.

I have used thread.join() to wait for thread completion. I used Python.

Update: I am adding below my query for the reference


create (session:Session {session_id: 'session_id1'})
;

match (s:Session) where s.session_id='session_id1' with s
create (e1:Event {insert_id: "insert_id1"}) set e1:SeenPage
create (s)-[:CONTAINS]->(e1)
create (s)-[:FIRST_EVENT]->(e1)
merge (pp1:Properties {value: "sample-url-1"}) set pp1:Page merge (e1)-[:RELATED_TO]->(pp1)
merge (pp2:Properties {value: "sp"}) set pp2:Type merge (e1)-[:RELATED_TO]->(pp2)
;


match (e1:SeenPage) where e1.insert_id='insert_id1' with e1
create (e2:Event {insert_id: "insert_id2"}) set e2:Show merge (e1)-[:NEXT]->(e2) with e2
match (s:Session) where s.session_id='session_id1' with s, e2
create (s)-[:CONTAINS]->(e2)
merge (pp1:Properties {value: "occasions"}) set pp1:Category merge (e2)-[:RELATED_TO]->(pp1)
merge (pp2:Properties {value: "sample-url-2"}) set pp2:Page merge (e2)-[:RELATED_TO]->(pp2)
merge (pp3:Properties {value: "pl"}) set pp3:Type merge (e2)-[:RELATED_TO]->(pp3)
merge (pp4:Properties {value: "child category"}) set pp4:Sub_Category merge (e2)-[:RELATED_TO]->(pp4)
;


match (e2:Show) where e2.insert_id='insert_id2' with e2
create (e3:Event {insert_id: "insert_id3"}) set e3:SeenPage merge (e2)-[:NEXT]->(e3) with e3
match (s:Session) where s.session_id='session_id1' with s, e3
create (s)-[:CONTAINS]->(e3)
merge (pp1:Properties {value: "/p-page-0"}) set pp1:Page merge (e3)-[:RELATED_TO]->(pp1)
merge (pp2:Properties {value: "sp"}) set pp2:Type merge (e3)-[:RELATED_TO]->(pp2)
;


match (e3:SeenPage) where e3.insert_id='insert_id3' with e3
create (e4:Event {insert_id: "insert_id4"}) set e4:Show merge (e3)-[:NEXT]->(e4) with e4
match (s:Session) where s.session_id='session_id1' with s, e4
create (s)-[:CONTAINS]->(e4)
merge (pp1:Properties {value: "rect1"}) set pp1:Category merge (e4)-[:RELATED_TO]->(pp1)
merge (pp2:Properties {value: "/p-page-1"}) set pp2:Page merge (e4)-[:RELATED_TO]->(pp2)
merge (pp3:Properties {value: "pl"}) set pp3:Type merge (e4)-[:RELATED_TO]->(pp3)
merge (pp4:Properties {value: "him"}) set pp4:Sub_Category merge (e4)-[:RELATED_TO]->(pp4)
;


match (e4:Show) where e4.insert_id='insert_id4' with e4
create (e5:Event {insert_id: "insert_id5"}) set e5:SeenPage merge (e4)-[:NEXT]->(e5) with e5
match (s:Session) where s.session_id='session_id1' with s, e5
create (s)-[:CONTAINS]->(e5)
merge (pp1:Properties {value: "/p-page-2"}) set pp1:Page merge (e5)-[:RELATED_TO]->(pp1)
merge (pp2:Properties {value: "sp"}) set pp2:Type merge (e5)-[:RELATED_TO]->(pp2)
;

This data represents the journey of a user on a website. The user starts a session and browses the pages. The action done by the user is recorded as the event. Each event has its unique id. Then the sequence of events is connected with the relationship :NEXT and CONTAINS. Events are not unique that's why I had to use create not merge. Properties of events are unique and these are created as nodes then added with a relationship RELATED_TO.

It's like this

#session contains events 
#events are connected with :next

Session-[:CONTAINS]->(Event1)-[:NEXT]-(Event2)<-[:CONTAINS]-Session

A session can contain 100s of events. The current speed of writing is quite slow. It is writing 10k sessions data in 4 hours. Each session contains on average 10 events. I am writing data event by event using a python bolt connector.

Any help would be really appreciated.

Rheatey Bash
  • 779
  • 6
  • 17
  • Can you provide your load query (or queries?) And an EXPLAIN plan for it? – InverseFalcon Apr 28 '21 at 19:17
  • It contains some undisclosable data. I totally understand that you need a query to understand the flow as well. If it is super important, I will try to post it. But my question is on a number of parallel connections in the community version. – Rheatey Bash Apr 28 '21 at 19:42
  • 1
    I would doubt that the parallel connections are involved, but parallel executions of the query may cause lock contention, and possible deadlock, which may explain what you're seeing. As such understanding what the query is doing, and the locks it may be taking, is important. Also you may want to review your logs and look for deadlocks. If your client app isn't using transactional functions, then it's possible transactions are being dropped and never retried when deadlocks occur. – InverseFalcon Apr 29 '21 at 07:00
  • This makes sense. Maybe this is the cause. Let me see it. – Rheatey Bash Apr 29 '21 at 07:53
  • @InverseFalcon I have updated the question. please see. – Rheatey Bash Apr 30 '21 at 18:34

0 Answers0