3

I'm using a batch inserter to create a database with about 1 billion nodes and 10 billion relationships. I've read in multiple places that it is preferable to sort the relationships in order min(from, to) (which I didn't do), but I haven't grasped why this practice is optimal. I originally thought this only aided insertion speed, but when I turned the database on, traversal was very slow. I realize there can be many reasons for that, especially with a database this size, but I want to be able to rule out the way I'm storing relationships.

Main question: does it kill traversal speed to insert relationships in a very "random" order because of where they will be stored on disk? I'm thinking that maybe when it tries to traverse nodes, the relationships are too fragmented. I hope someone can enlighten me about whether this would be the case.

UPDATES:

  • Use-case is pretty much the basic Neo4j friends of friends example using Cypher via the REST API for querying.

  • Each node (person) is unique and has a bunch of "knows" relationships for who they known. Although I have a billion nodes, all of the 10 billion relationships come from about 30 million of the nodes. So for any starting node I use in my query, it has an average of about 330 relationships coming from it.

  • In my initial tests, even getting 4 non-ordered friends of friends results was incredibly slow (100+ seconds on average). Of course, after the cache was warmed up for each query, it was fairly quick, but the graph is pretty random and I can't have the whole relationship store in memory.

Some of my system details, if that's needed: - Neo4j 1.9.RC1 - Running on Linux server, 128gb of RAM, 8 core machine, non-SSD HD

David Fox
  • 116
  • 6

1 Answers1

1

I have not worked with Neo4J on such a large scale, but as far as i know this won't make much difference in the speed. Could you provide any links which state the order of insertion matters.

What matters in this case if the relations are cached or not. Until the cache is fairly populated, performance will be on the slower side. You should also set an appropriate cache size as soon as the index is created.

You should read this link on regarding neo4j performance.

Read the neo4j documentation on batch insert and these SO questions for help with bulk insert if you haven't already read them.

Community
  • 1
  • 1
Aditya
  • 2,148
  • 3
  • 21
  • 34
  • Thanks. My main source is just from reading group postings around the web. Michael Hunger seems to be one of the biggest Neo4j experts, and I've seen him suggest it in numerous places such as this one: http://grokbase.com/t/gg/neo4j/12akrnzpzx/performance-problem-with-batchinsert (see Hunger's post) As for caching the relationships: I realize that's very important, but no matter what I do, all the relationships will never fit in memory. That's why disk reading is very important for me, hence why I want to know if relationship creation order comes into play on disk reads. – David Fox May 05 '13 at 07:11
  • Michael Hunger **is** an expert on Neo4j. So if he's said it, its definitely worth a try. But the post talks about batch inserts only, nothing about reads. In any case, he says increasing mmio helped, so maybe you should try it. – Aditya May 06 '13 at 01:50
  • You're correct, he doesn't say anything specifically about reading. The only thing I was able to find in reference to reading and fragmentation is this: https://groups.google.com/forum/?fromgroups=#!topic/neo4j/6s2KKmkAAX8 so it seems it's an actual problem, I'm just not sure if my case is where it comes into play. As for giving more memory to the relationship store, I don't think I can as I'm pretty much maxing out all my memory for the caches already. – David Fox May 06 '13 at 02:21
  • 1
    I don't think this applies to you, with 1 billion nodes, and 10 billion relationships, you will have around 20 relationships per node (10 incoming and 10 outgoing) on average. Most of that discussion is related to densely populated nodes. Fetching and iterating linearly over approx 20 nodes for friends, or 100 (20 * 20) for friends of friends should be a fairly quick operation always.. – Aditya May 06 '13 at 02:42
  • Hmm, interesting. My node/rel count is a little misleading. I updated the question with some more information on the structure. Summary: even though I have a billion nodes/10 billion rels, all the rels come from 30 million of the nodes. So the average is much closer to 330 relationships per starting node. – David Fox May 06 '13 at 02:55
  • @DavidFox - could you get any answer to your problem? – Aditya Jun 04 '13 at 12:08