I'm using a batch inserter to create a database with about 1 billion nodes and 10 billion relationships. I've read in multiple places that it is preferable to sort the relationships in order min(from, to) (which I didn't do), but I haven't grasped why this practice is optimal. I originally thought this only aided insertion speed, but when I turned the database on, traversal was very slow. I realize there can be many reasons for that, especially with a database this size, but I want to be able to rule out the way I'm storing relationships.
Main question: does it kill traversal speed to insert relationships in a very "random" order because of where they will be stored on disk? I'm thinking that maybe when it tries to traverse nodes, the relationships are too fragmented. I hope someone can enlighten me about whether this would be the case.
UPDATES:
Use-case is pretty much the basic Neo4j friends of friends example using Cypher via the REST API for querying.
Each node (person) is unique and has a bunch of "knows" relationships for who they known. Although I have a billion nodes, all of the 10 billion relationships come from about 30 million of the nodes. So for any starting node I use in my query, it has an average of about 330 relationships coming from it.
In my initial tests, even getting 4 non-ordered friends of friends results was incredibly slow (100+ seconds on average). Of course, after the cache was warmed up for each query, it was fairly quick, but the graph is pretty random and I can't have the whole relationship store in memory.
Some of my system details, if that's needed: - Neo4j 1.9.RC1 - Running on Linux server, 128gb of RAM, 8 core machine, non-SSD HD