I have some performance tests for an index structure on some data. I will be comparing 2 indexes side-by-side (still not decided if I will be using 2 VMs). I require results to be as neutral as possible of course, so I have these kinds of questions which I would appreciate any input about... How can I ensure/control what is influencing the test? For example, caching effects/order of arrival from one test to another will influence the result. How can I measure these influences? How do I create a suitable warm-up? Or what kind of statistical techniques can I use to nullify such influences (I don't think just averages is enough)?
2 Answers
Before you start:
- Make sure your tables and indices have just been freshly created and populated. This avoids issues with regard to fragmentation. Otherwise, if the data in one test is heavily fragmented, and the other is not, you might not be comparing apples to apples.
- Make sure your tables are properly ANALYZEd. This makes sure that the query planner has proper statistics in all cases.
If you just want a comparison, and not a test under realistic use, I'd just do:
- Cold-start your (virtual) machine. Wait a reasonable but fixed time (let's say 5 min, or whatever is reasonable for your system) so that all startup processes have taken place and do not interfere with the DB execution.
- Perform test with index1, and measure time (this is timing where you don't have anything cached by either the database nor the OS).
If you're interested in results when there are cache effects: Perform test again 10 times (or any number of times as big as reasonable). Measure each time, to account for variability due to other processes running on the VM, and other contingencies.
Reboot your machine, and repeat the whole process for test2. There are methods to clean the OS cache; but they're very system dependent, and you don't have a way to clean the database cache. Check See and clear Postgres caches/buffers?.
If you are really (or mostly) interested in performance when there are no cache effects, you should perform the whole process several times. It's slow and tedious. If you're only interested in the case where there's (most probably) a cache effect, you don't need to restart again.
Perform an ANOVA (or any other statistical hypothesis test you might think more suited) to decide if your average time is statistically different or not.
You can see an example of performing several tests in the answer to a question about NOT NULL versus CHECK(xx NOT NULL).
-
Thanks, incredibly helpful comment. Yes, ultimately I do just want a comparison but to be clear, what are you understanding with an "under realistic use"? – Zeruno Jan 29 '17 at 23:35
-
2**Realistic use** means: the database is running all kind of transactions because of your application and other applications. Your server is also running other processes and not only the database (let's say a mail server, a web server, file shares, ... you name it). In that case, *cache* effects will vary a lot. Some *smallish* tables may be very heavily used and be always in the database cache. Some others might be only partially cached or not cached at all... and the same exact query might take very different timings from one time to the next. – joanolo Jan 29 '17 at 23:38
-
Hi, it has been some time, but rereading your points are giving me some new things to think about. First, how often should a table be analyzed? Yes, fresh statistics are a good idea, but let's say that the index had no reason to change (there were only few trivial select queries executed for example) - is it still necessary to analyze frequently then? And one last point, is there some official standards for database benchmarking? I would like to thank you again for sharing so much information. – Zeruno Mar 11 '17 at 23:44
-
If PostgreSQL is working with more-or-less standard configuration, the autovacuum process will take care of reanalyzing your tables when the number of changes pass a certain threshold. If you want to perform tests, however, do *analyze* before you start (at least the first time), to make sure all your tests are based on *fully analyzed* data. – joanolo Mar 11 '17 at 23:59
-
With regard to benchmarks, I'm not sure they do exists as *standard*, but check [TPC](http://www.tpc.org/tpcc/results/tpcc_last_ten_results.asp) and see if that fits your needs. I know that PostgreSQL has got [pgbench](https://www.postgresql.org/docs/devel/static/pgbench.html): that serves to compare *versions* of PostgreSQL or the effect of different *configuration settings* or *operating systems* or *hardware*. – joanolo Mar 11 '17 at 23:59
As neutral as possible, then create two databases on the same instance of your database management system, then create the same tablespaces with data, using indexes on one instance but not the other.
The challenge with a VM is you have arbitrated access to your disk resources ( unless you have each VM pinned to a specific interface and disk set ). Because of this, your arbitration model could vary from one test to the next. The most neutral course, which removes the arbitration, is on physical hardware....and the same hardware in both cases.

- 5,606
- 1
- 14
- 14