2

I have studied a lot how durability is achieved in databases and if I understand well it works like this (simplified):

Clent's point of view:

  1. start transaction.
  2. insert into table values...
  3. commit transaction

DB engine point of view:

  1. write transaction start indicator to log file
  2. write changes done by client to log file
  3. write transaction commit indicator to log file
  4. flush log file to HDD (this ensures durability of data)
  5. return 'OK' to client

What I observed:

Client application is single thread application (one db connection). I'm able to perform 400 transactions/sec, while simple tests that writes something to file and then fsync this file to HDD performs only 150 syncs/sec. If client were multithread/multi connection I would imagine that DB engine groups transactions and does one fsync per few transactions, but this is not the case.

My question is if, for example MsSQL, really synchronizes log file (fsync, FlushFileBuffers, etc...) on every transaction commit, or is it some other kind of magic behind?

Kamil_H
  • 21
  • 1
  • Did you observe this with mysql, myIsam or Innodb databases? – osgx Mar 26 '11 at 11:59
  • 1
    No, only MsSQL and Oracle. I mean, I haven't tried others. I tried also SqLite, but it behaves as expected (commits cause real flush of log file to hdd, and performance is low - as expected) – Kamil_H Mar 26 '11 at 12:17

1 Answers1

0

The short answer is that, for a transaction to be durable, the log file has to be written to stable storage before changes to the database are written to disk.

Stable storage is more complicated than you might think. Disks, for example, are not usually considered to be stable storage. (Not by people who write code for transactional database engines, anyway.)

It see how a particular open source dbms writes to stable storage, you'll need to read the source code. PostgreSQL source code is online. (File is xlog.c) Don't know about MySQL source.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
  • OK, I've simplified the problem. I understand that simple fsync-ing file is not writing to 'stable storage' in terms of atomicity of the operation and more sophisticated algorithm has to be introduced. But this algorithm is one level up above fsync (it writes multiple HDD sectors and so on). By writing 'fsync' in my example I mean "log file is permanently stored on HDD and is consistent). A – Kamil_H Mar 26 '11 at 12:59
  • 1
    After rough scan of xlog.c I see that there are few modes of flushing log file to HDD but default is fsync. So I think, PostgreSQL transaction/sec would be limited to something about fsync/sec rate. But I observe something else on MsSQL and Oralcle so that's why I'm asking how is it done there. – Kamil_H Mar 26 '11 at 13:06