How to cache a big table from SQL Server

Question

I have a table with a lot of rows (3 million) from which I need to query some rows at several points in my app. The way I found to do this is querying all the data the first time that any was needed and storing it in a static DataTable with SqlAdapter.Fill() for the rest of the app life.

That's fast, because then when I need something I use DataTable.Select("some query") and the app processes the info just nice.

The problem is that this table takes about 800MB of RAM, and I have to run this app in PCs where it might be too much.

The other way I thought was to query the data I need each time. This takes little memory but has poor performance (a lot of queries to the database, which is at a network address and with 1000 queries you start to notice the ping and all that..).

Is there any intermediate point between performance and memory usage?

EDIT: What I'm retrieving are sales, which have a date, a product and a quantity. I query by product, and it isn't indexed that way. But anyways, making 1000 queries, even if the query took 0.05s, a 0.2s ping makes a total of 200 seconds...

A general rule of database performance tuning is to retrieve only the data that is needed, and you design indexing strategies around that. It doesn't sound like you're doing that; can you give more details about the queries being run, and the indexes being used? — Stuart Ainsworth, Aug 27 '15 at 13:02
there is a similar post here,check http://stackoverflow.com/questions/23816130/forcing-sql-server-to-pre-cache-entire-database-into-memory — , Aug 27 '15 at 13:03
Can this be an idea: Use delta pulling. fetch just the data which is required and store it in a collection, keep doing it. Of course, before you send a request to fetch data, compare your query keys if it is already present in your existing collection. — blogbydev, Aug 27 '15 at 13:06
Assuming you're already optimizing as much as possible with indexes and so forth, I'd consider batching up the queries if possible. I've had to deal with long travel times to the server before, so I understand what you're dealing with. If you Google `ado.net multiple queries` you'll find plenty of examples. — Ed Gibbs, Aug 27 '15 at 13:06
Are you doing 1000 queries sequentially or simultaneously? For a well-tuned query on a good server, 1000 simultaneous connections should be no problem. What's keeping you from indexing by product? — Stuart Ainsworth, Aug 27 '15 at 13:40
I can't index by anything because I'm not the database admin nor can ask for that. The problem I think is that I do them sequentially. Whenever a product object is created, I query for its sales. How could I do them simultaneously? — Pinx0, Aug 27 '15 at 13:46
"Not the database admin, nor can ask for that" - Is there a political reason, or a some other reason you cannot ask for an index? — StingyJack, Oct 13 '15 at 19:47

paparazzo · Answer 1 · 2015-10-13T20:06:53.467

First talk to the dba about performance

If you are downloading the entire table you might actually be putting more load on the network and SQL than if you performed individual queries.

As a dba if I knew you were downloading an entire large table I would put an index on product immediately.

Why are you performing 1000s of queries?

If you are looking for sales when a product is created then a cache is problematic. You would not yet have sales data. The problem with a cache is stale data. If you know the data will not change - you either have it or not then you can eliminate the concern of stale data.

There is something between sequentially and simultaneously. You can pack multiple selects in a single request. What this does is make a single round trip and is more efficient.

select * from tableA where ....;  
select * from tableB where ....;

With DataReader just call SqlDataReader.NextResult Method ()

using (SqlDataReader rdr = cmd.ExecuteReader())
{
   while (rdr.Read())
   {
   }
   rdr.NextResultSet();
   while (rdr.Read())
   {
   }
}

Pretty sure you can do the same type of thing with multiple DataTables in a DataSet.

Another option is a LocalDB. It is targeted at developers but for what you are doing it would work just fine. DataTable speed without memory concerns. You can even put an index on ProductID. It will take a little longer to write to disc compared to memory but you are not using up memory.

Then there is the ever evil with (nolock). Know what you are doing and I am not going to go into all the possible evils but I can tell you that I use it a lot.

score 2 · Answer 2 · answered Oct 13 '15 at 18:31

RE: "I can't index by anything because I'm not the database admin nor can ask for that."

Can you prepopulate a temp table and index on that?, e.g.

Select * into #MyTempTable from BigHugeTable 
Create Index Prodidx on #MyTempTable (product)

You will have to ensure you always reuse the same connection (and it isn't closed) in order to use the temp table.

score 2 · Accepted Answer · answered Oct 13 '15 at 19:06

The question can be precipitated to Memory vs Performance. The answer to that is Caching.

If you know what your usage pattern would be like, then one thing you can do is to create a local cache in the app.

The extreme cases are - your cache size is 800MB with all your data in it (thereby sacrificing memory) - OR - your cache size is 0MB and all your queries go to network (thereby sacrificing performance).

Three important questions about the design of the cache are answered below.

How to populate the Cache?

If you are likely to make some query multiple times, store it in cache and before going to network, test if your cache already has the result. If it doesn't, query the database and then store the result in the cache.
If after querying for some data, you are likely to query the next and/or previous piece of data, then query all of it once and cache it so that when you query the next piece, you already have it in cache.

Effectively the idea is that if you know some information may be needed in future, cache it beforehand.

How to free the Cache?

You can decide the freeing mechanism for cache either Actively or Passively.

Passively: Whenever cache is full you can evict the data from it.

Actively: Run a background thread at regular interval and it takes care of removal for you.

One hybrid method is to run a freeing thread as soon as you reach, let's say, 80% of your memory limit and then free whatever memory you can.

What data to remove from the Cache?

This has been answered already in context of the issue of Page Replacement Policies for Operating Systems.

For completion, I'll summarize the important ones here:

Evict the Least Recently Used data (if it is not likely to be used);
Evict the data that was brought in earliest (if the earliest data is not likely to be used);
Evict the data that was brought in latest (if you think that the newly brought in data is least likely to be used).
Automatically remove the data that is older than t time units.

How to cache a big table from SQL Server

3 Answers3