Opa: How to Efficiently Read/Write a Large Number of Records

Question

The Problem

I need to read and write a large number of records (about 1000). The example below takes as long as 20 minutes to write 1000 records, and as long as 12 seconds to read them (when doing my "read" tests, I comment out the line do create_notes()).

The Source

This is a complete example (that builds and runs). It only prints output to the console (not to the browser).

type User.t =
  { id : int
  ; notes : list(int) // a list of note ids
  }

type Note.t =
  { id : int
  ; uid : int // id of the user this note belongs to
  ; content : string
  }

db /user : intmap(User.t)
db /note : intmap(Note.t)

get_notes(uid:int) : list(Note.t) =
  noteids = /user[uid]/notes
  List.fold(
    (h,acc -> 
      match ?/note[h] with
      | {none} -> acc
      | {some = note} -> [note|acc]
    ), noteids, [])

create_user() =
  match ?/user[0] with
  | {none} -> /user[0] <- {id=0 notes=[]}
  | _ -> void

create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /user[0]/notes
  /user[0]/notes <- [key|noteids]

create_notes() =
  repeat(1000, create_note)

page() =
  do create_user()
  do create_notes()
  do Debug.alert("{get_notes(0)}")
  <>Notes</>

server = one_page_server("Notes", page)

One More Thing

I also tried getting notes via a transaction (shown below). It looks like a Db.transaction might be the right tool, but I haven't found a way to successfully employ it. I've found this get_notes_via_transaction method to be exactly as slow as get_notes.

get_notes_via_transaction(uid:int) : list(Note.t) =
  result = Db.transaction( ->
    noteids = /user[uid]/notes
    List.fold(
      (h,acc -> 
        match ?/note[h] with
        | {none} -> acc
        | {some = note} -> [note|acc]
      ), noteids, [])
  )
  match result with
  | {none} -> []
  |~{some} -> some

Thanks for your help.

Edit: More Details

A little extra info that might useful:

After more testing I've noticed that writing the first 100 records takes only 5 seconds. Each record takes longer to write than the previous one. At the 500th record, it takes 5 seconds to write each record.

If I interrupt the program (when it starts feeling slow) and start it again (without clearing the database), it writes records at the same (slow) pace it was writing when I interrupted it.

Does that get us closer to a solution?

I think you forgot to include definitions for `User.t` and `Note.t` so the example does not compile. Well, efficiency is not very good with the built-in solution Opa uses at the moment (although the numbers you cite do look scary). We're hard at work now at properly integrating MongoDB with Opa which should give you the state-of-the-art DB to work with. We also have a [CouchDB library](http://doc.opalang.org/api/#couchdb.opa.html/!/value_stdlib.apis.couchdb.CouchDb) to work with. — akoprowski, Oct 24 '11 at 07:09
The type definitions are in there at the top. Stack Overflow limits `code` blocks at a certain height and makes them scroll. Perhaps you need to scroll up in the code block? — nrw, Oct 24 '11 at 11:51
Also: Is my use of `DB.transaction` correct? And a regarding the future use of MongoDB in my Opa apps: Will I need to rewrite my apps to use an external database, and run a MongoDB server? — nrw, Oct 24 '11 at 11:59

score 3 · Answer 1 · answered Oct 26 '11 at 12:09

Nic, this is probably not the answer you were hoping for, but here it is:

I'd suggest for this kind of performance experiments to change the framework; for instance not to use the client at all. I'd replace the code from create_node function with this:

counter = Reference.create(0)
create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  noteids = /user[0]/notes
  do Reference.update(counter, _ + 1)
  do /user[0]/notes <- [key|noteids]
  cntr = Reference.get(counter)
  do if mod(cntr, 100) == 0 then
       Log.info("notes", "{cntr} notes created")
     else
       void
  void

import stdlib.profiler

create_notes() =
  repeat(1000, -> P.execute(create_note, "create_note"))

P = Server_profiler

_ =
  do P.init()
  do create_user()
  do create_notes()
  do P.execute(-> get_notes(0), "get_notes(0)")
  P.summarize()

With intermediate timing being printer per every 100 inserts you'll quickly see that the inserts times are quadratically to the number of inserted items, not linear. This is because of the list update /user[0]/notes <- [key|noteids] which apparently causes the whole list to be written again. AFAIK we had optimizations to avoid that, but either I'm wrong or for some reasons they do not work here -- I'll try to look into that and will let you know once I know more.

Previously mentioned optimization aside, a better approach to model this data in Opa would be using sets as in the following program:

type Note.t =
{ id : int
; uid : int // id of the user this note belongs to
; content : string
}

db /user_notes[{user_id; note_id}] : { user_id : int; note_id : int }
db /note : intmap(Note.t)

get_notes(uid:int) : list(Note.t) =
  add_note(acc : list(Note.t), user_note) =
    note = /note[user_note.note_id]
    [note | acc]
  noteids = /user_notes[{user_id=uid}] : dbset({user_id:int; note_id:int})
  DbSet.fold(noteids, [], add_note)

counter = Reference.create(0)

create_note() =
  key = Db.fresh_key(@/note)
  do /note[key] <- {id = key uid = 0 content = "note"}
  do DbVirtual.write(@/user_notes[{user_id=0}], {note_id = key})
  do Reference.update(counter, _ + 1)
  cntr = Reference.get(counter)
  do if mod(cntr, 100) == 0 then
       Log.info("notes", "{cntr} notes created")
     else
       void
  void

import stdlib.profiler

create_notes() =
  repeat(1000, -> Server_profiler.execute(create_note, "create_note"))

_ =
  do Server_profiler.init()
  do create_notes()
  do Server_profiler.execute(-> get_notes(0), "get_notes(0)")
  Server_profiler.summarize()

where you'll set that filling in the database takes ~2 seconds. Unfortunately this feature is heavily experimental and hence undocumented and, as you'll see, indeed it explodes on this example.

I'm afraid we don't really plan to improve on (3) and (4) as we realized that providing an in-house DB solution that is up to industrial standards is not very realistic. Therefore at the moment we're concentrating all our efforts on tight integration of Opa with existing No-SQL databases. We hope to have some good news about that in the coming weeks.

I'll try to learn more about this issue from our team and will make correction if I learn that I missed/got something wrong.

Wow. Thank you for such a complete answer. :D I'll start trying to grok your examples right away. — nrw, Oct 26 '11 at 12:14
The wonders of a hanging bounty :D. But seriously: I'm fully aware that DB is the Achilles' heel of Opa at the moment and I'd not want people to think that we're trying to sweep it under the carpet ;) — akoprowski, Oct 26 '11 at 12:20
I've spent a lot of time with these examples. Your first suggestion takes just as much time as my original code. Was this running faster than the original code for you? And the second example, as you mentioned, throws an exception. I'm going to try using CouchDB for my storage needs, though I'm taking a hit on usability. I would prefer using a built-in option. Is tight integration with "existing No-SQL databases" a high priority for mlstate? Do you have a timeline? — nrw, Oct 29 '11 at 20:43
(1) was just a suggestion for a better performance testing framework; not for obtaining better performance. (Tight) integration with existing No-SQL databases is a top priority for us now, but it's a complex topic so we're somewhat wary to providing timelines (and then missing them). I'll check internally whether we can say anything more definite than that. — akoprowski, Oct 30 '11 at 10:58
You did make it clear that you were suggesting a better testing framework. My mistake. I understand the reluctance to advertise a timeline. As of right now, is accessing a couchdb database via `stdlib.apis.couchdb` the most performant way to manage a large number of records from an opa app? — nrw, Oct 30 '11 at 11:27
Here's where I'm stuck: I'm inclined to build an app with Opa because of its "instant scalability" properties. However, Opa *seems* to only be "instantly scalable" if you can guarantee your dataset won't need to scale with the rest of your app. If your dataset needs to scale (a common necessity), Opa is scalable like any other tool: scaling out requires setup and maintenance of a broader toolset on more machines. So, in my case, I'm looking at managing Opa plus a set of CouchDb servers. Is Opa intended for use in apps that need scalable data persistence? If so, this is quite a road block. :-/ — nrw, Oct 30 '11 at 15:03
As of now, yes I believe interfacing CouchDB is your best bet and yes, that will mean maintaining a set of servers with Opa *and* with CouchDB. In the future we hope to integrate some external distributed databases with Opa in such a way that the symbiosis would work in pretty much the same way as it does now with Opa & its internal DB. Btw. we decided it's time to say more about it more publicly: http://blog.opalang.org/2011/11/opas-database-and-where-its-heading.html — akoprowski, Nov 01 '11 at 16:40

Opa: How to Efficiently Read/Write a Large Number of Records

The Problem

The Source

One More Thing

Edit: More Details

1 Answers1