2

What is the fastest way to get all unordered nodes and relationships from a running Neo4j 2.x server into a program?

Cypher MATCH n RETURN n is too slow for my use case (say we have >10M nodes to extract).

The shell command dump seems interesting but it requires some hack to call from a source code. Are there any benchmark available of dump?

Any advices appreciated!

--EDIT--

I execute the query thought the REST endpoint of a local Neo4j server (thus no network effect) with a query like MATCH n RETURN n SKPI 0 LIMIT 50000. My db is Neo4j 2.0.3 populated with 100k nodes of 1 integer property and no relationship. Computer: SSD with read speed 1.3+ Mo/s and CPU i7 1.6Ghz, JVM -Xmx2g. It takes ~4s to retreive 50k nodes:

curl -s -w %{time_total} -d"query=match n return n limit 50000" -D- -onul: http://localhost:7474/db/data/cypher

HTTP/1.1 200 OK
Content-Type: application/json; charset=UTF-8
Access-Control-Allow-Origin: *
Content-Length: 63394503
Server: Jetty(9.0.z-SNAPSHOT)

4,047
Seb
  • 618
  • 5
  • 11
  • How do you execute `match (n) return n`? The tx endpoint should be fast enough, it is rather limited by disk speed of loading the properties and probably network, if you only need the structure you can use `match (n) return id(n) as ID` – Michael Hunger Jun 24 '14 at 00:53

2 Answers2

2

The fastest way to get all nodes is to run Neo4j embedded. The performance degregation you see using the REST API via Cypher is largely due to the data transfer limitations over the network.

Using the method getAllNodes you can retrieve all the nodes in your graph without transfering the data over the network.

http://api.neo4j.org/current/org/neo4j/tooling/GlobalGraphOperations.html

try ( Transaction tx = db.beginTx(); ) {
    Iterable<Node> allNodes = db.getAllNodes();
    tx.success();
}

Note that this method is now deprecated as of 2.1.2.

To learn more about Neo4j embedded, take a look at the documentation.

http://docs.neo4j.org/chunked/stable/tutorials-java-embedded.html

Kenny Bastani
  • 3,268
  • 15
  • 20
  • 1
    I agree this is the correct answer - but to clarify, the API docs you link do not state that this is deprecated. Also, you might want to revise the code; in examples, db usually refers to a GraphDatabaseService object, when to get all nodes, you have to call a static method on GlobalGraphOperations, not "db.getAllNodes()" as you have listed here. – FrobberOfBits Jun 23 '14 at 19:56
  • Thank you for your answer, but I can't run Neo4j embedded. I have a running Neo4j server to deal with. This server might be hosted on the same computer as my program, but not necessarily. Is it possible to access a Neo4j db running on a server (i.e. the disk files) in read only mode using the Java API? That would provide a partial solution to my problem. – Seb Jun 24 '14 at 07:25
  • The data transfer speeds are your only limitation. The data itself is likely being returned in less than 4 seconds but the time it takes to transfer the data over an HTTP stream will take ~3 seconds. – Kenny Bastani Jun 24 '14 at 17:59
1

What you want is enable HTTP chunked encoding (aka Steaming) to allow Neo4j to start sending you results without holding them all in memory. You do this by adding the Accept: application/json;stream=true HTTP request header.

This requests does the trick:

curl -i -o streamed.txt -XPOST \
  -d'{ "query":"MATCH n RETURN n" }' \
  -H 'accept:application/json;stream=true' \
  -H 'content-type:application/json' \
  'http://localhost:7474/db/data/cypher'

On a side note, if you want to start parsing the response on your side before having received the whole content (to avoid filling up your memory / hard drive), you may want to look into JSON stream parsing.

Community
  • 1
  • 1
david_p
  • 5,722
  • 1
  • 32
  • 26