How can I efficiently create unique relationships in Neo4j?

Question

Following up on my question here, I would like to create a constraint on relationships. That is, I would like there to be multiple nodes that share the same "neighborhood" name, but each uniquely point to a particular city in which they reside.

As encouraged in user2194039's answer, I am using the following index:

CREATE INDEX ON :Neighborhood(name)

Also, I have the following constraint:

CREATE CONSTRAINT ON (c:City) ASSERT c.name IS UNIQUE;

The following code fails to create unique relationships, and takes an excessively long period of time:

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
WITH line
WHERE line.Neighborhood IS NOT NULL
WITH line
MATCH (c:City { name : line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name : toInt(line.Neighborhood)});

Note that there is a uniqueness constraint on City, but NOT on Neighborhood (because there should be multiple ones).

Profile with Limit 10,000:

+--------------+------+--------+---------------------------+------------------------------+
|     Operator | Rows | DbHits |               Identifiers |                        Other |
+--------------+------+--------+---------------------------+------------------------------+
|  EmptyResult |    0 |      0 |                           |                              |
|  UpdateGraph |    9750 |      3360 | anon[307], b, neighborhood, line |                 MergePattern |
|  SchemaIndex |    9750 |      19500 |                   b, line | line.City; :City(name) |
| ColumnFilter |    9750 |      0 |                      line |            keep columns line |
|       Filter |    9750 |      0 |           anon[220], line |                    anon[220] |
|      Extract |    10000 |      0 |           anon[220], line |                    anon[220] |
|        Slice |    10000 |      0 |                      line |                 {  AUTOINT0} |
|      LoadCSV |    10000 |      0 |                      line |                              |
+--------------+------+--------+---------------------------+------------------------------+

Total database accesses: 22860

Following Guilherme's recommendation below, I implemented the helper yet it is raising the error py2neo.error.Finished. I've searched the documentation, and wasn't able to determine a work around from this. It looks like there's an open SO post about this exception.

def run_batch_query(queries, timeout=None):
if timeout:
    http.socket_timeout = timeout
try:
    graph = Graph()
    authenticate("localhost:7474", "account", "password")
    tx = graph.cypher.begin()
    for query in queries:
        statement, params = query
        tx.append(statement, params)
        results = tx.process()
        tx.commit()
except http.SocketError as err:
    raise err
except error.Finished as err:
    raise err
collection = []
for result in results:
    records = []
    for record in result:
        records.append(record)
    collection.append(records)  
return collection

main:

queries = []
template = ["MERGE (city:City {Name:{city}})", "Merge (city)<-[:IN]-(n:Neighborhood {Name : {neighborhood}})"]
statement = '\n'.join(template)
batch = 5000
c = 1
start = time.time()

# city_neighborhood_map is a defaultdict that maps city-> set of neighborhoods
for city, neighborhoods in city_neighborhood_map.iteritems():
    for neighborhood in neighborhoods:
        params = dict(city=city, neighborhood=neighborhood)
        queries.append((statement, params))
        c +=1
        if c % batch == 0:
            print "running batch"
            print c
            s = time.time()*1000
            r = run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

print c
if queries:
    s = time.time()*1000 
    r = run_batch_query(queries, 300)
    e = time.time()*1000
    print("\t{0} {1:.00f}ms".format(c, e-s))
end = time.time()
print("End. {0}s".format(end-start))

maybe with CREATE UNIQUE ? http://neo4j.com/docs/stable/query-create-unique.html — Christophe Willemsen, May 25 '15 at 22:03
How many records are you trying to import at a time from those csv files? — justinpawela, May 26 '15 at 17:39

Guilherme · Accepted Answer · 2015-06-04T04:27:49.677

If you want to create unique relationships you have 2 options:

Prevent the path from being duplicated, using MERGE, just like @user2194039 suggested. I think this is the simplest, and best approach you can take.
Turn your relationship into a node, and create an unique constraint on it. But it's hardly necessary for most cases.

If you're having trouble with speed, try using the transactional endpoint. I tried importing your data (random cities and neighbourhoods) through IMPORT CSV in 2.2.1, and I it was slow as well, though I am not sure why. If you send your queries with parameters to the transactional endpoint in batches of 1000-5000, you can monitor the process, and probably gain a performance boost. I managed to import 1M rows in just under 11 minutes.

I used an INDEX for Neighbourhood(name) and a unique constraint for City(name). Give it a try and see if it works for you.

Edit:

The transactional endpoint is a restful endpoint that allows you do execute transactions in batch. You can read about it here. Basically, it allows you to stream a bunch of queries to the server at once.

I don't know what programming language/stack you're using, but in python, using a package like py2neo, it would be something like this:

with open("city.csv", "r") as fp:

    reader = csv.reader(fp)

    queries = []
    template = ["MERGE (c :`City` {name: {city}})",
                "MERGE (c)<-[:IN]-(n :`Neighborhood` {name: {neighborhood}})"]

    statement = '\n'.join(template)

    batch = 5000

    c = 1

    start = time.time()

    for row in reader:

        city, neighborhood = row

        params = dict(city=city, neighborhood=neighborhood)

        queries.append((statement, params))

        if c % batch == 0:

            s = time.time()*1000
            r = neo4j.run_batch_query(queries, 10)
            e = time.time()*1000
            print("\t{0}, {1:.00f}ms".format(c, e-s))
            del queries[:]

        c += 1

    if queries:

        s = time.time()*1000
        r = neo4j.run_batch_query(queries, 300)
        e = time.time()*1000
        print("\t{0} {1:.00f}ms".format(c, e-s))

    end = time.time()

    print("End. {0}s".format(end-start))

Helper functions:

def run_batch_query(queries, timeout=None):

    if timeout:
        http.socket_timeout = timeout

    try:
        graph = Graph(uri) # "{protocol}://{host}:{port}/db/data/"
        tx = graph.cypher.begin()

        for query in queries:
            statement, params = query

            tx.append(statement, params)

        results = tx.process()

        tx.commit()

    except http.SocketError as err:
        raise err

    collection = []
    for result in results:

        records = []

        for record in result:

            records.append(record)

        collection.append(records)

    return collection

You will monitor how long each transaction takes, and you can tweak the number of queries per transactions, as well as the timeout.

What's a transactional endpoint? Would you mind elaborating on this in your answer? — NumenorForLife, Jun 01 '15 at 19:29
I've spent a few hours playing around with this, but haven't gotten it to work out. See my updated post above for the issues that arose. Why did you choose to use backticks in the template? — NumenorForLife, Jun 03 '15 at 14:51
JSC, if you read the py2neo doc, that error means that you're trying to use an object that is no longer available, i.e. a transaction that has been closed. Check your code, and make sure you don't try to re-use a commited transaction, and if the time span between transactions is long, create a new one each time. The code executes well with py2neo 2.0.x a and neo4j 2.2.x I suggest you look at the neo4j doc, and that of the driver you intend to learn how to interact with the transactional endpoint. — Guilherme, Jun 04 '15 at 03:05
thanks for all your help. Before I award you the bounty and select this as the correct answer, could you please un-indent the results = tx.process() and tx.commit() lines. That resolved the error.Finished mentioned above. Also, I think you should remove the "true" value passed to run_batch_query so as to only pass two arguments. — NumenorForLife, Jun 04 '15 at 04:04

justinpawela · Answer 2 · 2015-06-01T16:09:19.563

0

To be sure we're on the same page, this is how I understand your model: Each city is unique and should have some number of neighborhoods pointing to it. The neighborhoods are unique within the context of a city, but not globally. So if you have a neighborhood 3 [IN] city Boston, you could also have a neighborhood 3 [IN] city Seattle, and both of those neighborhoods are represented by different nodes, even though they have the same name property. Is that correct?

Before importing, I would recommend adding an index to your neighborhood nodes. You can add the index without enforcing uniqueness. I have found that this greatly increases speeds on even small databases.

CREATE INDEX ON :Neighborhood(name)

And for the import:

USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file://THEFILE" as line
MERGE (c:City {name: line.City})
MERGE (c)<-[:IN]-(n:Neighborhood {name: toInt(line.Neighborhood)})

If you are importing a large amount of data, it may be best to use the USING PERIODIC COMMIT command to commit periodically while importing. This will reduce the memory used in the process, and if your server is memory-constrained, I could see it helping performance. In your case, with almost a million records, this is recommended by Neo4j. You can even adjust how often the commit happens by doing USING PERIODIC COMMIT 10000 or such. The docs say 1000 is the default. Just understand that this will break the import into several transactions.

Best of luck!

edited Jun 01 '15 at 16:09

answered May 26 '15 at 21:25

justinpawela

1,968
1
14
18

By using a match instead of merge I get rid of an eager that appears – NumenorForLife Jun 01 '15 at 00:04
It's taking over 30 minutes to process the million rows with the code above. It took around 13 seconds to go through the first 10,000 items. I've added the profile of that command above.. – NumenorForLife Jun 01 '15 at 00:49
Are you still having problems with duplications, or is it just a speed issue now? – justinpawela Jun 01 '15 at 04:56
I haven't faced a uniqueness issue. It's about speed, taking over 10 hours. It hasn't finished yet. – NumenorForLife Jun 01 '15 at 12:28
1

Sorry, I must have misunderstood part of your original question. What did you mean when you said your first version of the query "fails to create unique relationships"? – justinpawela Jun 01 '15 at 15:43
I poorly described the issue. I meant that it failed to create unique relationships because it never returned. So it's primarily a speed issue. – NumenorForLife Jun 01 '15 at 15:55
Well I would encourage you again to look at `USING PERIODIC COMMIT`. The more I read from Neo4j on this issue, the more it looks like it's important. See [Batch Your Transactions](http://www.neo4j.org/graphgist?d788e117129c3730a042#_batch_your_transactions) in this guide from Neo4j. The sub-header is "This is really important". – justinpawela Jun 01 '15 at 18:55

How can I efficiently create unique relationships in Neo4j?

2 Answers2

Linked