3

this is more of a best-practices question. I am implementing a search back-end for highly structured data that, in essence, consists of ontologies, terms, and a complex set of mappings between them. Neo4j seemed like a natural fit and after some prototyping I've decided to go with py2neo as a way to communicate with neo4j, mostly because of nice support for batch operations. This is more of a best practices question than anything.

What I'm getting frustrated with is that I'm having trouble with introducing the types of higher-level abstraction that I would like to in my code - I'm stuck with either using the objects directly as a mini-orm, but then I'm making lots and lots of atomic rest calls, which kills performance (I have a fairly large data set).

What I've been doing is getting my query results, using get_properties on them to batch-hydrate my objects, which preforms great and which is why I went down this route in the first place, but this makes me pass tuples of (node, properties) around in my code, which gets the job done, but isn't pretty. at all.

So I guess what I'm asking is if there's a best practice somewhere for working with a fairly rich object graph in py2neo, getting the niceties of an ORM-like later while retaining performance (which in my case means doing as much as possible as batch queries)

thefurman
  • 31
  • 2

2 Answers2

4

I am not sure whether I understand what you want, but I had a similar issue. I wanted to make a lot of calls and create a lot of nodes, indexes and relationships.. (around 1.2 million) . Here is an example of adding nodes, relationships, indexes and labels in batches using py2neo

from py2neo import neo4j, node, rel
gdb = neo4j.GraphDatabaseService("<url_of_db>")
batch = neo4j.WriteBatch(gdb)

a = batch.create(node(name='Alice'))
b = batch.create(node(name='Bob'))

batch.set_labels(a,"Female")
batch.set_labels(b,"Male")

batch.add_indexed_node("Name","first_name","alice",a) #this will create an index 'Name' if it does not exist
batch.add_indexed_node("Name","first_name","bob",b) 

batch.create(rel(a,"KNOWS",b)) #adding a relationship in batch

batch.submit() #this will now listen to the db and submit the batch records. Ideally around 2k-5k records should be sent 
gaurav.singharoy
  • 3,751
  • 4
  • 22
  • 25
  • My comment seems a bit off-topic, but here goes anyhow: if you run this once then check `gdb.order` it will show 2 (for alice and bob). If you run it a second time you'll have order 4, since you're creating the nodes without checking their uniqueness (e.g. by using `batch.get_or_create_indexed_node(...)`). So this code will work only if you know a priori that your nodes are unique. The problem with `get_or_create_indexed_node` is that the return value [doesn't seem to work](http://stackoverflow.com/q/20010509/1174169) with `set_labels` or `set_properties` (py2neo 1.6.1, Neo4j 2.0.0-RC1) – cod3monk3y Dec 12 '13 at 04:24
2

Since your asking for best practice, here is an issue I ran into:

When adding a lot of nodes (~1M) with py2neo in a batch, my program often gets slow or crashes when the neo4j server runs out of memory. As a workaround, I split the submit in multiple batches:

from py2neo import neo4j

def chunker(seq, size):
    """
    Chunker gets a list and returns slices 
    of the input list with the given size.
    """
    for pos in xrange(0, len(seq), size):
        yield seq[pos:pos + size]


def submit(graph_db, list_of_elements, size):
    """
    Batch submit lots of nodes.
    """

    # chunk data
    for chunk in chunker(list_of_elements, size):

        batch = neo4j.WriteBatch(graph_db)

        for element in chunk:
            n = batch.create(element)
            batch.add_labels(n, 'Label')

        # submit batch for chunk
        batch.submit()
        batch.clear()

I tried this with different chunk sizes. For me, it's fastest with ~1000 nodes per batch. But I guess this depends on the RAM/CPU of your neo4j server.

Martin Preusse
  • 9,151
  • 12
  • 48
  • 80