3

I have a large (long) table of data stored in SQLite - potentially 5m+ entries. I am using the System.Data.SQLite package to execute my query and read the data into a bespoke in-memory collection structure in the regular ADO.net way.

CODE (F#)

use cnxn = new SQLiteConnection(@"Data Source=C:\Temp\test.db;Version=3;Read Only=True;")
cnxn.Open()

let data = ResizeArray<Data>()

let cmd = new SQLiteCommand(@"SELECT X, Y, Z, AAA FROM Data", cnxn)
let reader = cmd.ExecuteReader()

while reader.Read() do
    let d = {X = reader.GetInt32(0); Y = reader.GetInt32(1); 
                    Z = reader.GetInt32(2); AAA = reader.GetDouble(3)}
    data.Add(d)
cnxn.Close()

Questions

  1. Is System.Data.SQLite the most performant library to be using for the job here? I am only using it because it appears to be the standard

  2. Is their a better way to code this up?

  3. Are there any settings/configurations on the database itself that would help this scenario?

Why do I think this should be able to go faster?

My computer has a theoretical read speed of 725 mb/s (SSD). Reading the sqlite above I am reading 40mb in 1s which is an effective actual speed of 40 mb/s.

Another surprising result from profiling shows that about 35% of the time is spent on reader.Read() [not surprising] and the remainder in GetInt32 and GetDouble() [very surprising].

ildjarn
  • 62,044
  • 9
  • 127
  • 211
Sam
  • 2,745
  • 3
  • 20
  • 42
  • 6
    Why would you want to bring 5 million rows to memory all at once? Nothing can speed that up beyond a certain point. Might want to rethink that design. – Hanky Panky Nov 05 '15 at 08:24
  • 1
    @Hanky웃Panky Thanks, but I really do want bring them ALL into memory and keep them there for an indefinite period - with full knowledge of why that may not appear to be good design. – Sam Nov 05 '15 at 08:28
  • Measure your loop: check whether the `Read()` or the `Add()` takes the most time. – CL. Nov 05 '15 at 08:31
  • @CL. From profiling: most time is taken in `Read()` then in `GetInt32()` and then `GetDouble()` – Sam Nov 05 '15 at 08:43
  • There's nothing you can do to speed up the `Read()` (except maybe [VACUUM](http://www.sqlite.org/lang_vacuum.html)); it's normal for databases to be limited by the speed of the disk. – CL. Nov 05 '15 at 08:44
  • @CL. I just added a bit to the Question on why I think it could be faster. – Sam Nov 05 '15 at 08:55
  • Do you have an index covering those columns? – Sven Grosen Nov 05 '15 at 15:15
  • @SvenGrosen I've tried with and without an index on `X,Y and Z` and it makes little difference. This is probably because I'm not filtering or grouping the data in the table at all. – Sam Nov 05 '15 at 15:17
  • Not sure if this would gain you anything, but have you seen [this](http://stackoverflow.com/a/11385280/1346943)? – Sven Grosen Nov 05 '15 at 15:25
  • @Sam did the index include 'AAA' as well? – N_A Nov 05 '15 at 16:00
  • @mydogisbox No. That's a random `Double` value so I didn't that would be appropriate for inclusion in the index. – Sam Nov 05 '15 at 16:04
  • Is it an Included Column? If not then the index wouldn't be used. Try creating an index with `X`, `Y` and `Z` as the key and `AAA` as an included column. – N_A Nov 05 '15 at 16:16
  • What's all this debate about indexes? No indexes are going to make any difference on a full table dump request. Dont bother wasting time on that point – Hanky Panky Nov 05 '15 at 17:38
  • @Hanky웃Panky You're assuming its a FULL table dump. If the table has a lot of other columns it could make a significant difference in the amount of time it takes to retrieve the data. – N_A Nov 05 '15 at 17:56
  • In my experience, [SQL Server Compact](https://msdn.microsoft.com/en-us/library/bb734673(v=sql.110).aspx) is noticeably faster than SQLite. – ildjarn Nov 05 '15 at 23:53
  • 2
    @Hanky웃Panky Fewer columns in an index results in fewer leaf pages stored on disc. Fewer pages means less disc io to load the table into memory. If the desired columns are 1/10 of the columns in the clustered index, then having an index covering only those columns could be a 90% reduction in data loaded from disc into memory and thus 1/10 the query time. – N_A Nov 06 '15 at 00:44

0 Answers0