3

I'm trying to export a database to a file and import it again without copying the actual database files or stopping the database. I realize that there are a number of excellent (and performant) neo4j-shell-tools however the Neo4j database is remote and the export-* and import-* commands require that files to reside on the remote client, whereas for my circumstances these reside locally.

The following post explains alternative methods for exporting/importing data however the import isn't overly performant.

The following examples use a subset of our data store comprising of 10,000 nodes with various labels/properties for testing purposes. Firstly the database was exported via,

> time cypher-shell 'CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {batchSize: 1000, format: "cypher-shell", separateFiles: true})'  
real    0m1.703s

and then wiped,

neo4j stop
rm -rf /var/log/neo4j/data/databases/graph.db
neo4j start

before re-importing,

time cypher-shell < /tmp/graph.db.nodes.cypher
real    0m39.105s

which doesn't seem overly performant. I also tried the Python route, by exporting the Cypher in plain format:

CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {format: "plain", separateFiles: true})

The following snippet ran in ~ 30s (using a batch size of 1,000),

from itertools import izip_longest
from neo4j.v1 import GraphDatabase


with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
            for chunk in izip_longest(*[file] * 1000):
                with session.begin_transaction() as tx:
                for line in chunk:
                    if line:
                        tx.run(line)

I realize that parameterized Cypher queries are more optimal I used the somewhat kludgy logic (note the string replace isn't suffice for all cases) to try to extract the labels and properties from the Cypher code (which executed in ~ 8s):

from itertools import izip_longest
import json
from neo4j.v1 import GraphDatabase
import re


def decode(statement):
    m = re.match('CREATE \((.*?)\s(.*?)\);', statement)
    labels = m.group(1).replace('`', '').split(':')[1:]
    properties = json.loads(m.group(2).replace('`', '"')) # kludgy    
    return labels, properties


with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
            for chunk in izip_longest(*[file] * 1000):
                with session.begin_transaction() as tx:
                    for line in chunk:
                        if line:
                            labels, properties = decode(line)

                        tx.run(
                            'CALL apoc.create.node({labels}, {properties})', 
                            labels=labels, 
                            properties=properties,
                        )

Using UNWIND rather than transactions further improves performance to ~ 5s:

with GraphDatabase.driver('bolt://localhost:7687') as driver:
    with driver.session() as session: 
        with open('/tmp/graph.db.nodes.cypher') as file:
        for chunk in izip_longest(*[file] * 1000):
            rows = []

            for line in chunk:
                if line:
                    labels, properties = decode(line)
                    rows.append({'labels': labels, 'properties': properties})

            session.run(
                """
                UNWIND {rows} AS row
                WITH row.labels AS labels, row.properties AS properties
                CALL apoc.create.node(labels, properties) YIELD node
                RETURN true
                """,
                rows=rows,
            )

Is this the right approach for speeding up a Cypher import? Ideally I would love not to have to do this level of manipulation in Python, it part because it's possibly error prone and I'll have to do something similar for relationships.

Also does anyone know the correct approach to decode Cypher to extract the properties? This method fails if there's a back-tick (`) in a property. Note I don't want to go down the GraphML route as I also need to the schema which gets exported via the Cypher format. Though it does feel strange to unpack the Cypher in this way.

Finally for reference the import-binary shell commands takes ~ 3s to perform the same import:

> neo4j-shell -c "import-binary -b 1000 -i /tmp/graph.db.bin"
...
finish after 10000 row(s)  10. 100%: nodes = 10000 rels = 0 properties = 106289 time 3 ms total 3221 ms
John
  • 783
  • 6
  • 12

0 Answers0