What can cause a program to run much faster the second time?

Question

Something I've noticed when testing code I write is that long-running operations tend to run much longer the first time a program is run than on subsequent runs, sometimes by a factor of 10 or more. Obviously there's some sort of cold cache/warm cache issue here, but I can't seem to figure out what it is.

It's not the CPU cache, since these long-running operations tend to be loops that I feed a lot of data to, and they should be fully loaded after the first iteration. (Plus, unloading and reloading the program should clear the cache.)

Also, it's not the disc cache. I've ruled that out by loading all data from disc up-front and processing it afterwards, and it's the actual CPU-bound data processing that's going slowly.

So what can cause my program to run slow the first time I run it, but then if I close it and run it again, it runs dramatically faster? I've seen this in several different programs that do very different things, so it seems to be a general issue.

EDIT: For clarification, I'm writing in Delphi, though I don't really think this is a Delphi-specific issue. But that means that whatever the problem is, it's not related to JIT issues, garbage collection issues, or any of the other baggage that managed code brings with it. And I'm not dealing with network connections. This is pure CPU-bound processing.

One example: a script compiler. It runs like this:

Load entire file into memory from disc
Lex the entire file into a queue of tokens
Parse the queue into a tree
Run codegen on the tree to produce bytecode

If I feed it an enormous script file (~100k lines,) after loading the entire thing from disc into memory, the lex step takes about 15 seconds the first time I run, and 2 seconds on subsequent runs. (And yes, I know that's still a long time. I'm working on that...) I'd like to know where that slowdown is coming from and what I can do about it.

Are you sure that it is not the I/O caching? This sounds like the OS's I/O caching layer to me. CPU calculations are *not* cached, so it *has* to be something in storage somewhere... — John Gietzen, Sep 26 '11 at 21:12
It's most likely the disk cache. But the question is a little vague at the moment. — David Heffernan, Sep 26 '11 at 21:13
@Ed Language is irrelevant to this question. Mason has tagged it fine. — David Heffernan, Sep 26 '11 at 21:13
@David see first answer for counterexample. I could add more for other languages. Don't assume. — Ed Staub, Sep 26 '11 at 21:15
You've implied that, for the first run, it's _uniformly_ slower. Do you really know that - or is it possible that it's just a lot slower, say, on the first iteration? — Ed Staub, Sep 26 '11 at 21:19
Are you running within an IDE, or directly from the command line? — Ed Staub, Sep 26 '11 at 21:21
@Ed: Writing in Delphi. Not tagging this as Delphi because I really don't think it's a Delphi-specific problem. And I'm running under the IDE, but not doing anything that the debugger should cause to happen more slowly. (And I've got Windows' Debug Heap functionality turned off.) — Mason Wheeler, Sep 26 '11 at 21:29
Needing 15 seconds to lex a 100 kb file that's stored in memory is bizarrely long. Find out why it is so slow first, out pops the reason why it is faster later. Pay attention to page faults btw. — Hans Passant, Sep 26 '11 at 21:33
@Hans: Not 100 KB; 100 K **lines**. The file size is a hair under 3 MB. — Mason Wheeler, Sep 26 '11 at 21:37
Three thoughts: 1) Virus scanner? 2) Have you tried running the executable without the IDE? 3) Do you use some DLLs that are probably only loaded the first time? — Toon Krijthe, Sep 26 '11 at 21:43
What effect (if any) does restarting the IDE between runs have? — Ed Staub, Sep 26 '11 at 21:58
Gamecats virus scanner point is a very good one. A lot of virus scanners will only do a full check of the app the first time it's run. Other times, they'll probably just do a hash-check on the file to confirm that it wasn't changed. @Gamecat - you should put that in an answer. — , Sep 26 '11 at 22:37
@Steve314: Good idea, but no. I'll see the same effect even when I modify things a little and re-run, which should require a full re-scan if that was what was doing it. — Mason Wheeler, Sep 26 '11 at 22:42
Well, if it is a CPU-bound issue, it should scale with CPU speed. Have you tried running the application on a much slower CPU? Does the differential scale accordingly? — HMcG, Sep 27 '11 at 06:13
Have you tried to disable Windows [Prefetch](http://en.wikipedia.org/wiki/Prefetcher) / [Superfetch](http://en.wikipedia.org/wiki/Windows_Vista_I/O_technologies#SuperFetch)? Maybe one of them is doing its work during the first execution of your program. — JRL, Sep 27 '11 at 19:34
Have you watched the performance on task manager's resource monitor? That will help you to isolate whether it's truly CPU-bound or whether you're getting page faults or I/O stalls. And if it is CPU bound, you can see whether it's your code or whether some other process is interacting. — TMN, Sep 28 '11 at 12:41
Somewhat related re: microbenchmarking: [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) — Peter Cordes, May 24 '22 at 14:25

Eric Grange · Answer 1 · 2011-11-07T10:33:51.987

Three things to try:

Run it in a sampling profiler, including a "cold" run (first thing after a reboot). Should usually be enough.
Check memory usage, does it grow so high (even transiently) the OS would have to swap things out of RAM to make room for your app? That alone could be an explanation for what you're seeing. Also look at the amount of free RAM you have when you start your app.
Enable system performance tools and check the I/O counters or file accesses, and make sure under FileMon / Process Explorer that you don't have some file or network accesses you've forgotten about (leftover log/test code)

score 5 · Answer 2 · 2011-09-26T22:32:31.977

Even (especially) for very small command-line program, the issue can be the time it takes to load the process, link to dynamically-linked libraries etc. I believe modern operating systems avoid repeating a lot of this work if the same program is run twice at once, or repeatedly.

I wouldn't dismiss CPU cache so easily, as well. Level 0 cache is very relevant for inner loops, but much less so for a second run of the same application. On my cheap Athlon 2 X4 645 system, there's 64K + 64K (data + instruction) level 0 cache per core - not exactly a huge amount of memory. Level 1 cache is IIRC 512K per core, so less likely to be dirtied to complete irrelevance by the O/S code needed to start up a new run of the program, calls to operating system services and standard libraries, etc. Level 2 cache (on CPUs that have it - my Athlon 2 doesn't, IIRC) is larger still, and there may be some even higher level and larger cache provided by the motherboard/chipset.

There's at least one other kind of cache - branch prediction tables. Though I'd have thought they'd be dirtied to irrelevance even quicker than the level 0 cache.

I generally find that unit test programs run many times slower the first time. However, the larger and more complex the program, the less significant the effect.

For some time now, performance of applications has often been considered non-deterministic. Although it isn't strictly true, the performance is determined by so many hard-to-predict factors that it's a good model. For example, if the CPU is a bit warm, the clock speed may be reduced to prevent overheating. And the temperature varies at different parts of the chip, with changes conducting across the chip in complex ways. As changes in clock speed and the different demands of different pieces of code alter the patterns of changing temperature, there's a clear potential for chaotic (as in chaos theory) behaviour.

On some platforms, I wouldn't be surprised if the first run of the program got the processor to run if it's "fast" (rather than cool/quiet) mode, and that meant that the beginning of the second run benefitted from that speed boost as well as the end. However, this would be a tricky one - it would have to be a CPU-intensive program, and if your cooling is inadequate, the processor may then slow down again to avoid overheating.

I'm not counting load times here. This is all GUI-driven, so I can make sure everything is loaded before all the number-crunching begins. — Mason Wheeler, Sep 26 '11 at 22:48
I have to admit, on second thoughts, inflating from 2 seconds to 15 seconds seems implausible for these effects anyway. — , Sep 26 '11 at 22:56
IMHO CPU cache and branch predictions / pipelines handling are not relevant here, since the code is always the same. — Arnaud Bouchez, Sep 27 '11 at 05:26
@Arnaud - the code is the same yet the performance is different - that's exactly when you look at issues like CPU cache, branch predictions etc. If the code is already in the CPU instruction cache, then that code will run faster than if it wasn't. If there is relevant branching information still in the table, then branch prediction will be more accurate, and that code will run faster than if it wasn't. — , Sep 28 '11 at 00:27
*WITHOUT* any form a caching (and ignoring a few technicalities like initial hard disk head location, and assuming the program has full control of the machine), all programs would run in exactly the same time each time they are run with the same data. Once upon a time, it was possible to determine exact run-times - to the clock cycle - because there were few if any caches to consider, there was no multitasking etc. — , Sep 28 '11 at 00:31
@Steve314 Exactly. With a "naive" OS like Toro, or running program in your own extended mode (using DWPL for instance - I've done that it's pretty cool), you may experiment CPU caching huge influence. But today's OS are much more advanced than that. CPU caching will definitively not cause huge time difference here, but some other background task, like library initialization and such. — Arnaud Bouchez, Sep 29 '11 at 05:43

score 5 · Answer 3 · answered Sep 27 '11 at 19:27

5

I'd guess it's all your libraries/DLLs. These are usually loaded on-demand at run-time, so the first time your program runs the OS will have to read them all from disk. Once read, though, they'll stay loaded unless your system starts running low on memory. So if you run the same program several times in succession, the first run takes the brunt of the load time, and the other runs benefit from the pre-loaded libraries.

answered Sep 27 '11 at 19:27

TMN

3,060
21
23

1

No, this is all stuff that's taking place after everything has been loaded from disc. I made sure to account for that. – Mason Wheeler Sep 27 '11 at 20:08
1

How much temporary space does your code use? Maybe it's not stuff getting paged *in* that's the problem, maybe it's stuff getting paged *out* to make room. Once it's paged out and your program exits, the system will have lots of available memory for subsequent runs. Have you noticed if your browser or other programs tend to "hesitate" the first time you access them after running your code? – TMN Sep 28 '11 at 12:50

Arnaud Bouchez · Answer 4 · 2011-09-27T05:41:40.487

I usually experienced the contrary: for computation intensitive work (if anti virus is not working), I only have a 5-10% diff between calls. For instance, the 6,000,000 regression tests run for our framework have a very constant time of running, and it's very disk and CPU intensive work.

I really don't believe of a CPU cache or pipelining / branch prediction issue either, since both processed data and code seem to be consistent, as you wrote. If anti virus is off, it may be about OS thread settings: did you try to change the process CPU affinity and priority?

This should be very specific to the process you are running. Without any actual source code to reproduce it, it's almost impossible to tell what's happening with you. How many threads are there? What is the HW configuration (isn't there any Intel CPU boost there - are you using a laptop, and what are your energy settings)? Is it using CPU/FPU/MMX/SSE2 (e.g. MMX and FPU do not mix)? Does it move a lot of data, or process some existing data? Does your SW depends on external libraries (even some Windows libraries may need some time to initialize)? How do you use memory (did you try to pre-allocate the memory; or on a multi-threaded application, did you try using a scaling MM instead of FastMM4)?

I think using a sample profiler may not help so much, since it will change the general CPU core use, but it's worth trying in all cases. I'd better rely on logging profiling - see e.g. this class or you may write your own timestamps to find where the timing changes in your app.

AFAIK it has always been written that, when benchmarking, the first run of an application shall never be taken in account. Computer systems are so complex nowadays, that the first time, all the internal (SW and HW) plumbing is to be purged - so you shall not drink the first water coming out of your tap when you come back from 1 month of travel. ;)

score 2 · Answer 5 · answered Sep 27 '11 at 07:30

Other factors I can think of would be memory-alignment (and the subsequent cache line fills), but say there are 2 types : perfect alignment (being fastest) and imperfect (being slower), one would expect it to occur irregularly (depending on how memory is laid out).

Perhaps it has something to do with physical page layout? As far as I know, each memory-access goes through the MMU page table entries, so dispersed physical pages could be slower than consecutive pages. (Just a wild guess, this one)

Another thing I haven't seen mentioned yet, is on which core(s) your process is running - especially on hyper-threaded CPU's, running on the slower of the two cores might have a negative impact. Try setting the processor affinity mask on one and the same core for every run, and see if that impacts the measured runtime differences between first and subsequent runs.

By the way - how do you define 'first run'? Could it be that you've just compiled the executable? In that case (and I'm just guessing again here), some process (either the OS, a virus-scanner, or even some root-kit) might be busy analyzing your executable's behaviour, which might be skipped once the executable has been analyzed before. You could try to prove that by changing some random unimportant byte of your executable between runs, and see if that impacts the runtime negatively again?

Please post a summary once you figured out the cause(s) - this might help others too. Cheers!

score 2 · Answer 6 · answered Sep 27 '11 at 16:51

2

Just a random guess...

Does your processor support adaptive frequency? Maybe it's just the processor that doesn't have time to adapt its frequency on the first run, and is running full speed on second one.

answered Sep 27 '11 at 16:51

Ken Bourassa

6,363
1
19
28

2

AFAIK even an old AMD Athlon 64 could change frequency 30 times per second, so that doesn't explain the 13 second gap. – The_Fox Sep 28 '11 at 06:47

score 1 · Answer 7 · answered Sep 26 '11 at 21:14

There are lots of things that can cause this. Just as one example: if you're using ADO.NET for data access with connection pooling turned on (which is the default), the first time your application runs it will take the hit of creating the database connection. When your app is closed, the connection is maintained in its open state by ADO.NET, so the next time your app runs and does data access it will not have to take the hit of instantiating the connection, and thus will appear faster.

score 1 · Answer 8 · answered Sep 26 '11 at 21:21

1

Guessing your using .net if im wrong you could ignore most of my ideas...

Connection pooling, JIT compilation, reflection, IO Caching the list goes on and on....

Try testing smaller portions of the code to see what parts change performance the most...

You could try ngen'ing your assemblies as this removes the JIT compilation.

answered Sep 26 '11 at 21:21

Peter

37,042
39
142
198

Sorry, writing in Delphi. No managed-bytecode-stuff causing trouble here. – Mason Wheeler Sep 26 '11 at 21:24

score 0 · Answer 9 · answered Sep 27 '11 at 18:58

where that slowdown is coming from and what I can do about it.

I would speak about quick execution the next times can from from performance caching

Disk internal cache (8MB or more)
Windows applicationDependencies (as DLL)/Core cache
CPU cache L3 (or L2 if some programming loop are small enough)

So you see that the first time you do not benefits from these caching systems.

Amrita Bithi · Answer 10 · 2023-06-20T02:12:04.823

Especially if you are using Python ( or this can happen also with NodeJS, PHP, and other scripting languages that compile at runtime ), then you will usually see a significant speed increase the second time a program is run because the first run will create a cached version of the script compiled to it's bytecode for faster loading on subsequent starts of the program.

I am working on a Python project right now, and when I am done making changes to it and ready to have it "running in production" I will always start the script up and let it load until it has completed it's startup phase of loading includes and declaring variables and then just kill and restart the program and it will run nice and smooth from there.

Every time any of the code changes it will have to rebuild it's cached bytecode, so when leaving anything to run on a server or to be left running I always start and restart before leaving it beacuse you see this effect in various types of software.

( There are also optimizers that eliminate all unused procedures from the run-time library of programs translated by TURBO Pascal / Delphi, and the overall effect is the same, that runs after the initial run will load faster. )

What can cause a program to run much faster the second time?

10 Answers10

Linked