1

Updated question:

I'm new in neo4j and im trying to create a movie rating database. I have a file with 100.000 registers (2,3 MB) that represents when a user rated a movie; the file look like this (this dataset is from MovieLens):

dataset file

I'm using py2neo and I create de datebase with this code:

data = pd.read_csv('ratings_small.csv')
def create_bipartite_graph(data):
#Indexes creation to perform the queries
    graph.run('''
        CREATE INDEX user_index IF NOT EXISTS FOR (u:User) ON (u.UserId)
    ''')
    graph.run('''
        CREATE INDEX movie_index IF NOT EXISTS FOR (m:Movie) ON (m.MovieId)
    ''')
    actual_user_node = None
    for index, row in data.iterrows():
        userID = int(row['userId'])
        movieID = int(row['movieId'])
        rating = row['rating']
        date = datetime.fromtimestamp(row['timestamp']).strftime("%m/%d/%Y, %H:%M:%S")

        #Creation nodes and relationships
        graph.run('''
            MERGE(u:User{UserId: $uID})
            MERGE(m:Movie{MovieId: $mID})
            CREATE (u)-[:RATED_MOVIE{rating: $r, date: $d}]->(m)
        ''', parameters = {'uID': userID, 'mID': movieID, 'r': rating, 'd': date})

The problem is with that small dataset, takes more than 2 hours in create the graph. Any advice for the time of the databse creation decrease considerably?

Using LOAD CSV:

This is the query that I execute in neo4j browser

EXPLAIN LOAD CSV With HEADERS FROM 'file:///ratings_small.csv' AS line FIELDTERMINATOR ',' 
MERGE(m:Movie{MovieId: toInteger(line.movieId)})
MERGE(u:User{UserId: toInteger(line.userId)})
CREATE (u)-[:RATED_MOVIE{rating:toFloat(line.rating)}]->(m)

And this is the profile plan: profile plan

suribe06
  • 69
  • 8

2 Answers2

1

Every graph run call represents not only a full round trip to the server, but a completely separate transaction. Working like this, with one such call inside every cycle of the loop, is therefore incredibly inefficient.

I suggest instead taking a look at the bulk data operations API: https://py2neo.readthedocs.io/en/stable/bulk/index.html

Nigel Small
  • 4,475
  • 1
  • 17
  • 15
0

You are starting with a CSV file, right? You could use LOAD CSV to bring it into Neo4j. Use your python code to put the csv in the Neo4j Import directory. Then run this query ...

LOAD CSV With HEADERS FROM 'file:///ratings_small.csv' AS line FIELDTERMINATOR ',' MERGE(m:Movie{MovieId: toInteger(line.mID)})
    ''', parameters = {'mID': toInteger(line.movieID),rating:toFloat(line.rating)})

You can add this at the start if you want to iterate 5000 rows at a time

Using periodic commit 5000 ....

This should run much faster!

David A Stumpf
  • 753
  • 5
  • 13
  • For that MERGE, you'll need an index on `:Movie(MovieId)` or insertions will keep slowing down – InverseFalcon Apr 12 '21 at 03:59
  • You might want to preface this my adding a dummy node, create an index so the merge runs quicker, and then delete the dummy node after the load. – David A Stumpf Apr 12 '21 at 03:59
  • @InverseFalcon In order to create an index, don't I need all the nodes to be created? Or I can create a dummy node as David comments. – suribe06 Apr 12 '21 at 04:34
  • @DavidAStumpf I will try the `LOAD CSV` to see if it works, thanks – suribe06 Apr 12 '21 at 04:36
  • @DavidAStumpf `LOAD CSV` generates an error. I try this [1] but still doesnt work. [1]: https://stackoverflow.com/questions/56573653/neo-clienterror-statement-externalresourcefailed – suribe06 Apr 12 '21 at 05:24
  • @user14700833 No, you can create the index ahead of time, it doesn't require the nodes to exist first. It will update as data is added. – InverseFalcon Apr 12 '21 at 05:43
  • @InverseFalcon I created the indexes for :User(UserId) and :Movie(MovieId) and it still takes too long. For a test of 10 registers, takes 1 second. I don't want to imagine how long it will take for the 100,000 registers – suribe06 Apr 12 '21 at 05:56
  • That doesn't sound right. 10k should take a second or less, 100k shouldn't be a problem at all. This is still with a single LOAD CSV? Do an EXPLAIN of the query, and add the query plan to your question. – InverseFalcon Apr 12 '21 at 06:06
  • @InverseFalcon I already edited the code that I put in my question for I am currently doing. Cypher uses in the queries (MATCH and MERGE in my case) the indexes if they exist, there is no need to specify them in the query. I dont use `LOAD CSV` because I get an error (see my other comment) – suribe06 Apr 12 '21 at 06:57
  • @suribe06 I was referring to how we use [EXPLAIN to view a profile plan](https://neo4j.com/docs/cypher-manual/current/query-tuning/query-options/#how-do-i-profile-a-query) to verify that the indexes are being used, and that there isn't anything surprising happening the query. Also, you should have a single query to address the entire input (or at least batch it) using UNWIND, since using a separate query per entry isn't going to be performant. – InverseFalcon Apr 12 '21 at 07:01
  • Can you show us your whole query so we can better coach you. Are you putting the file in the Neo4j import directory? Are your just using the file name and not the path in the query? LOAD CSV should run quickly. – David A Stumpf Apr 12 '21 at 07:15
  • @DavidAStumpf @InverseFalcon I updated the question in both cases (using `LOAD CSV` and without) – suribe06 Apr 12 '21 at 09:07
  • Thanks, the query plan looks fine. Two approaches you could use: a single run exactly with this query, or divide it into multiple runes, one to create all the nodes, the other to MATCH to the already-created nodes and create the relationships. If you know the rels don't exist, and if you won't be creating duplicates from the CSV, then use CREATE instead of MERGE for the rels. – InverseFalcon Apr 12 '21 at 09:56
  • You might also separate this into 3 sequential queries. That may speed it up. That is, do a load csv for each node and then for the relationship. That's what I generally do. – David A Stumpf Apr 13 '21 at 18:03