Optimizing py2neo's cypher insertion

Question

I am using py2neo to import several hundred thousand nodes. I've created a defaultdict to map neighborhoods to cities. One motivation was to more efficiently import these relationships having been unsuccessful with Neo4j's load tool.

Because the batch documentation suggests to avoid using it, I veered away from an implementation like the OP of this post. Instead the documentation suggests I use Cypher. However, I like the being able to create nodes from the defaultdict I have created. Plus, I found it too difficult importing this information as the first link demonstrates.

To reduce the speed of the import, should I create a Cypher transaction (and submit every 10,00) instead of the following loop?

for city_name, neighborhood_names in city_neighborhood_map.iteritems():
     city_node = graph.find_one(label="City", property_key="Name", property_value=city_name)
         for neighborhood_name in neighborhood_names:
              neighborhood_node = Node("Neighborhood", Name=neighborhood_name)
              rel = Relationship(neighborhood_node, "IN", city_node)
              graph.create(rel)

I get a time-out, and it appears to be pretty slow when I do the following. Even when I break up the transaction so it commits every 1,000 neighborhoods, it still processes very slowly.

tx = graph.cypher.begin()
statement = "MERGE (city {Name:{City_Name}}) CREATE (neighborhood { Name : {Neighborhood_Name}}) CREATE (neighborhood)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
    for neighborhood_name in neighborhood_names:
        tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})
tx.commit()

It would be fantastic to save pointers to each city so I don't need to look it up each time with the merge.

score 2 · Accepted Answer · answered Jun 03 '15 at 07:26

2

It may be faster to do this in two runs, i.e. CREATE all nodes first with unique constraints (which should be very fast) and then CREATE the relationships in a second round.

Constraints first, use Labels City and Neighborhood, faster MATCH later:

graph.schema.create_uniqueness_constraint('City', 'Name')
graph.schema.create_uniqueness_constraint('Neighborhood', 'Name')

Create all nodes:

tx = graph.cypher.begin()

statement = "CREATE (:City {Name: {name}})"
for city_name in city_neighborhood_map.keys():
    tx.append(statement, {"name": city_name})

statement = "CREATE (:Neighborhood {Name: {name}})"
for neighborhood_name in neighborhood_names: # get all neighborhood names for this
    tx.append(statement, {name: neighborhood_name})

tx.commit()

Relationships should be fast now (fast MATCH due to constraints/index):

tx = graph.cypher.begin()
statement = "MATCH (city:City {Name: {City_Name}}), MATCH (n:Neighborhood {Name: {Neighborhood_Name}}) CREATE (n)-[:IN]->(city)"
for city_name, neighborhood_names in city_neighborhood_map.iteritems():
    for neighborhood_name in neighborhood_names:
        tx.append(statement, {"City_Name": city_name, "Neighborhood_Name": neighborhood_name})

tx.commit()

answered Jun 03 '15 at 07:26

Martin Preusse

9,151
12
48
80

As I mentioned in this link (http://stackoverflow.com/questions/30444845/how-can-i-efficiently-create-unique-relationships-in-neo4j), neighborhoods aren't unique, so I can't create a uniqueness constraint. Because of this, I can't create cities and neighborhoods separately, and need to merge on the relationship. – NumenorForLife Jun 03 '15 at 12:03
Why are they not unique? Just add them by name, if a neighborhood is in multiple cities, the city-neighborhood relationship contains that information. – Martin Preusse Jun 03 '15 at 12:58
Because a neighborhood named "Downtown" can appear in multiple cities. I think there should be a unique node to represent each neighborhood named "Downtown." Do you agree with this structure? – NumenorForLife Jun 03 '15 at 13:52
Depends on the queries. Why not just Downtown and a link between a city and Downtown? – Martin Preusse Jun 03 '15 at 14:07
Because there's unique information about each Downtown (like population size). I think it would be better to have a distinct node for each downtown. – NumenorForLife Jun 03 '15 at 14:30
create distinct nodes then.. creating nodes first and relationships later is still faster. – Martin Preusse Jun 03 '15 at 14:54
"Because there's unique information about each Downtown" : then you can put the information as property in the relationship. – bendaizer Jun 09 '15 at 12:13
Is it necessary to commit if I don't use transactions? When I newly create a database, everything works fine. But once I restart the neo4j community server, the results are wrong, even the number of relationships and nodes are wrong. What can be the issue? – Nitin Apr 06 '16 at 22:34

Optimizing py2neo's cypher insertion

1 Answers1