5

I am building a data-intensive Python application based on neo4j and for performance reasons I need to create/recover several nodes and relations during each transaction. Is there an equivalent of SQLAlchemy session.commit() statement in bulbs?

Edit:

for those interested, an interface to the Bulbs have been developped that implements that function natively and otherwise functions pretty much just like SQLAlchemy: https://github.com/chefjerome/graphalchemy

chiffa
  • 2,026
  • 3
  • 26
  • 41
  • You can also look into the Neo4j REST-APIs batch-rest-operation mode which executes multiple commands in a transaction or the new transactional http endpoint which allows to use transactions across multiple http requests, see http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/ – Michael Hunger May 27 '13 at 06:50
  • Thanks for the response, but it won't work for me. I had implemented a pretty decent transnational batch insert with the native neo4j python api (you can have a look at it [here](https://github.com/chiffa/var/blob/master/Neo4j_transactional_batch_insert.py) ). The problem is that the database is very likely to grow way over 32 10^9 nodes (bioinformatics) and I was looking to build something that could be ported to Titan GraphDB later. – chiffa May 27 '13 at 15:34

1 Answers1

7

The most performant way to execute a multi-part transaction is to encapsulate the transaction in a Gremlin script and execute it as a single request.

Here's an example of how to do it -- it's from an example app I worked up last year for the Neo4j Heroku Challenge.

The project is called Lightbulb: https://github.com/espeed/lightbulb

The README describes what it does...

What is Lightbulb?

Lightbulb is a Git-powered, Neo4j-backed blog engine for Heroku written in Python.

You get to write blog entries in Emacs (or your favorite text editor) and use Git for version control, without giving up the features of a dynamic app.

Write blog entries in ReStructuredText, and style them using your website's templating system.

When you push to Heroku, the entry metadata will be automatically saved to Neo4j, and the HTML fragment generated from the ReStructuredText source file will be served off disk.

However, Neo4j quit offering Gremlin on their free/test Heroku Add On so Lightbulb won't work for new Neo4j/Heroku users.

Within the next year -- before the TinkerPop book comes out -- TinkerPop will release a Rexster Heroku Add On with full Gremlin support so people can run their projects on Heroku as they work their way through the book.

But for right now, you don't need to concern yourself with running the app -- all the relevant code is contained within these two files -- the Lightbulb app's model file and its Gremlin script file:

https://github.com/espeed/lightbulb/blob/master/lightbulb/model.py https://github.com/espeed/lightbulb/blob/master/lightbulb/gremlin.groovy

model.py provides an example for building custom Bulbs models and a custom Bulbs Graph class.

gremlin.groovy contains a custom Gremlin script that the custom Entry model executes -- this Gremlin script encapsulates the entire multi-part transaction so that it can be executed as a single request.

Notice in the model.py file above, I customize EntryProxy by overriding the create() and update() methods and instead define a singular save() method to handle creates and updates.

To hook the custom EntryProxy into the Entry model, I simply override the Entry model's get_proxy_class method so that it returns the EntryProxy class instead of the default NodeProxy class.

Everything else in the Entry model is designed around building up the data for the save_blog_entry Gremlin script (defined in the gremlin.groovy file above).

Notice in gremlin.groovy that the save_blog_entry() method is long and contains several closures. You could define each closure as an independent method and execute them with multiple Python calls, but then you'd have the overhead of making multiple server requests and since the requests are separate, there would be no way to wrap them all in a transaction.

By using a single Gremlin script, you combine everything into a single transactional request. This is much faster, and it's transactional.

You can see how the entire script is executed in the final line of the Gremlin method:

return transaction(save_blog_entry);

Here I'm simply wrapping a transaction closure around all the commands in internal save_blog_entry closure. Making a transaction closure keeps code isolated and is much cleaner than embedding the transaction logic into the other closures.

Then if you look at the code in the internal save_blog_entry closure, it's just calling the other closures I defined above, using the params I passed in from Python when I called the script in the Entry model:

def _save(self, _data, kwds):
    script = self._client.scripts.get('save_blog_entry')
    params = self._get_params(_data, kwds)
    result = self._client.gremlin(script, params).one() 

The params I pass in are built up in the model's custom _get_parms() method:

def _get_params(self, _data, kwds):
    params = dict()

    # Get the property data, regardless of how it was entered
    data = build_data(_data, kwds)

    # Author
    author = data.pop('author')
    params['author_id'] = cache.get("username:%s" % author)

    # Topic Tags
    tags = (tag.strip() for tag in data.pop('tags').split(','))
    topic_bundles = []
    for topic_name in tags:
        #slug = slugify(topic_name)
        bundle = Topic(self._client).get_bundle(name=topic_name)
        topic_bundles.append(bundle)
    params['topic_bundles'] = topic_bundles


    # Entry
    # clean off any extra kwds that aren't defined as an Entry Property
    desired_keys = self.get_property_keys()
    data = extract(desired_keys, data)
    params['entry_bundle'] = self.get_bundle(data)

    return params

Here's what's _get_params() is doing...

buld_data(_data, kwds) is a function defined in bulbs.element: https://github.com/espeed/bulbs/blob/master/bulbs/element.py#L959

It simply merges the args in case the user entered some as positional args and some as keyword args.

The first param I pass into _get_params() is author, which is the author's username, but I don't pass the username to the Gremlin script, I pass the author_id. The author_id is cached so I use the username to look up the author_id and set that as a param, which I will later pass to the Gremlin save_blog_entry script.

Then I create Topic Model objects for each blog tag that was set, and I call get_bundle() on each and save them as a list of topic_bundles in params.

The get_bundle() method is defined in bulbs.model: https://github.com/espeed/bulbs/blob/master/bulbs/model.py#L363

It simply returns a tuple containing the data, index_name, and index keys for the model instance:

def get_bundle(self, _data=None, **kwds):
    """
    Returns a tuple containing the property data, index name, and index keys.

    :param _data: Data that was passed in via a dict.
    :type _data: dict

    :param kwds: Data that was passed in via name/value pairs.
    :type kwds: dict

    :rtype: tuple

    """
    self._set_property_defaults()   
    self._set_keyword_attributes(_data, kwds)
    data = self._get_property_data()
    index_name = self.get_index_name(self._client.config)
    keys = self.get_index_keys()
    return data, index_name, keys

I added the get_bundle() method to Bulbs to provide a nice and tidy way of bundling params together so your Gremlin script doesn't get overrun with a ton of args in its signature.

Finally, for Entry, I simply create an entry_bundle and store it as the param.

Notice that _get_params() returns a dict of three params: author_id, topic_bundle, and entry_bundle.

This params dict is passed directly to the Gremlin script:

def _save(self, _data, kwds):
    script = self._client.scripts.get('save_blog_entry')
    params = self._get_params(_data, kwds)
    result = self._client.gremlin(script, params).one()        
    self._initialize(result)

And the Gremlin script has the same arg names as those passed in by params:

def save_blog_entry(entry_bundle, author_id, topic_bundles) {

   // Gremlin code omitted for brevity 

}

The params are then simply used in the Gremlin script as needed -- nothing special going on.

So now that I've created my custom model and Gremlin script, I build a custom Graph object that encapsulates all the proxies and the respective models:

class Graph(Neo4jGraph):

    def __init__(self, config=None):
        super(Graph, self).__init__(config)

        # Node Proxies
        self.people = self.build_proxy(Person)
        self.entries = self.build_proxy(Entry)
        self.topics = self.build_proxy(Topic)

        # Relationship Proxies
        self.tagged = self.build_proxy(Tagged)
        self.author = self.build_proxy(Author)

        # Add our custom Gremlin-Groovy scripts
        scripts_file = get_file_path(__file__, "gremlin.groovy")
        self.scripts.update(scripts_file)

You can now import Graph directly from your app's model.py and instantiate the Graph object like normal.

>> from lightbulb.model import Graph  
>> g = Graph()
>> data = dict(username='espeed',tags=['gremlin','bulbs'],docid='42',title="Test")
>> g.entries.save(data)         # execute transaction via Gremlin script

Does that help?

espeed
  • 4,754
  • 2
  • 39
  • 51
  • 1
    Thanks for the reply, it is actually pretty cool, even if I thought I could avoid Gremlin for a while. I'll try to implement the similar approach as you, it might be actually the simples thing to do for me. – chiffa May 27 '13 at 15:53
  • Gremlin is pretty cool, and it doesn't take that long to get the hang of it. The Gremlin examples above are in Gremlin-Groovy (the original) ,and that's what most people use. And Gremlin is what Titan uses so if you are eventually going to migrate to it like you indicated above, then starting with Gremlin would make things simple. – espeed May 27 '13 at 19:01
  • James, would you mind explaining a little bit more what you are doing in the Gremlin script? I have a particular trouble trouble with seeing why you need the "index_key" argument in "create_or_update_vertex" and "get_and_create_vertex" – chiffa Jun 22 '13 at 01:32
  • 1
    Hi Andrei - in Neo4j property values aren't guaranteed to be unique across all vertices. Bulbs creates a separate index for each model, and all properties are indexed by default. The `create_or_update_vertex` and `get_and_create_vertex` closures are looking up the property name in the model's index to see if it exists. The property name (index_key) is being used like a unique primary key for the model's index. These two closures update or return the vertex if a vertex exists for the index_key/value pair, otherwise they create the vertex. Note I wrapped all of this in the transaction closure. – espeed Jul 16 '13 at 07:44