I'm trying to export a database to a file and import it again without copying the actual database files or stopping the database. I realize that there are a number of excellent (and performant) neo4j-shell-tools however the Neo4j database is remote and the export-*
and import-*
commands require that files to reside on the remote client, whereas for my circumstances these reside locally.
The following post explains alternative methods for exporting/importing data however the import isn't overly performant.
The following examples use a subset of our data store comprising of 10,000 nodes with various labels/properties for testing purposes. Firstly the database was exported via,
> time cypher-shell 'CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {batchSize: 1000, format: "cypher-shell", separateFiles: true})'
real 0m1.703s
and then wiped,
neo4j stop
rm -rf /var/log/neo4j/data/databases/graph.db
neo4j start
before re-importing,
time cypher-shell < /tmp/graph.db.nodes.cypher
real 0m39.105s
which doesn't seem overly performant. I also tried the Python route, by exporting the Cypher in plain format:
CALL apoc.export.cypher.all("/tmp/graph.db.cypher", {format: "plain", separateFiles: true})
The following snippet ran in ~ 30s (using a batch size of 1,000),
from itertools import izip_longest
from neo4j.v1 import GraphDatabase
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
tx.run(line)
I realize that parameterized Cypher queries are more optimal I used the somewhat kludgy logic (note the string replace isn't suffice for all cases) to try to extract the labels and properties from the Cypher code (which executed in ~ 8s):
from itertools import izip_longest
import json
from neo4j.v1 import GraphDatabase
import re
def decode(statement):
m = re.match('CREATE \((.*?)\s(.*?)\);', statement)
labels = m.group(1).replace('`', '').split(':')[1:]
properties = json.loads(m.group(2).replace('`', '"')) # kludgy
return labels, properties
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
with session.begin_transaction() as tx:
for line in chunk:
if line:
labels, properties = decode(line)
tx.run(
'CALL apoc.create.node({labels}, {properties})',
labels=labels,
properties=properties,
)
Using UNWIND
rather than transactions further improves performance to ~ 5s:
with GraphDatabase.driver('bolt://localhost:7687') as driver:
with driver.session() as session:
with open('/tmp/graph.db.nodes.cypher') as file:
for chunk in izip_longest(*[file] * 1000):
rows = []
for line in chunk:
if line:
labels, properties = decode(line)
rows.append({'labels': labels, 'properties': properties})
session.run(
"""
UNWIND {rows} AS row
WITH row.labels AS labels, row.properties AS properties
CALL apoc.create.node(labels, properties) YIELD node
RETURN true
""",
rows=rows,
)
Is this the right approach for speeding up a Cypher import? Ideally I would love not to have to do this level of manipulation in Python, it part because it's possibly error prone and I'll have to do something similar for relationships.
Also does anyone know the correct approach to decode Cypher to extract the properties? This method fails if there's a back-tick (`) in a property. Note I don't want to go down the GraphML route as I also need to the schema which gets exported via the Cypher format. Though it does feel strange to unpack the Cypher in this way.
Finally for reference the import-binary
shell commands takes ~ 3s to perform the same import:
> neo4j-shell -c "import-binary -b 1000 -i /tmp/graph.db.bin"
...
finish after 10000 row(s) 10. 100%: nodes = 10000 rels = 0 properties = 106289 time 3 ms total 3221 ms