5

I'm using neo4j for the first time, neography for Ruby. I have my data in csv files. I can successfully populate the database through my main file, i.e. create all nodes. So, for each csv file (here, user.csv), I'm doing -

def create_person(name, id)
  Neography::Node.create("name" => name, "id" => id)
end

CSV.foreach('user.csv', :headers => true) do |row|
  id = row[0].to_i()
  name = row[1]
  $persons[id] = create_person(name, id)
end

Likewise for other files. There are two issues now. Firstly, if my files are very small, then it goes fine, but when files are slightly big, I get (I'm dealing with 4 1MB files) -

SocketError: Too many open files (http://localhost:7474)

Another issue is that I don't want to do this (populate db) every time I run this ruby file. I want to populate the data once and then don't want to touch the database. After that I only want to run queries on it. Can anyone please tell me how to populate it and save it? And then how can I load it whenever I want to use it. Thank you.

theharshest
  • 7,767
  • 11
  • 41
  • 51

4 Answers4

2

Sounds as if you run these requests in parallel or don't reuse http connections.

Did you try to do @neo=Neography::Rest.new and @neo.create_node({...}) I think that one reuses the http connections.

Michael Hunger
  • 41,339
  • 3
  • 57
  • 80
2

Create a @neo client:

  @neo = Neography::Rest.new

Create a queue:

  @queue = []

Make use of the BATCH api for data loading.

def create_person(name, id)
  @queue << [:create_node, {"name" => name, "id" => id}]
  if @queue.size >= 500
    batch_results = neo.batch *@queue
    @queue = []
    batch_results.each do |result|
      id = result["body"]["self"].split('/').last
      $persons[id] = result
    end
  end
end

Run through you csv file:

CSV.foreach('user.csv', :headers => true) do |row|
  create_person(row[1], row[0].to_i)
end

Get the leftovers:

    batch_results = @neo.batch *@queue
    batch_results.each do |result|
      id = result["body"]["self"].split('/').last
      $persons[id] = result
    end

An example of data loading via the rest api can be seen here => https://github.com/maxdemarzi/neo_crunch/blob/master/neo_crunch.rb

An example of using a queue for writes can be seen here => http://maxdemarzi.com/2013/09/05/scaling-writes/

Max De Marzi
  • 1,098
  • 6
  • 11
0

Are you running the whole import in one big transaction? Try to split it up in transactions of, say, 10k nodes. You should still however not run into "too many open files". If you do an "lsof" (terminal command) at that time, can you see which files are open?

Data that is committed stays persisted in a neo4j database. I think that the import fails with this error and nothing stays imported since the whole import runs in one big transaction.

Mattias Finné
  • 3,034
  • 1
  • 15
  • 7
0

Remember you can backup your Neo4j database once you have everything written. This is handy in cases when it takes a long time to populate the database and you're doing testing. Just make a copy of the /data/graph.db folder.

firefly2442
  • 557
  • 8
  • 18