What caused my elapsed time much longer than user time?

Question

I am benchmarking some R statements (see details here) and found that my elapsed time is way longer than my user time.

   user  system elapsed 
  7.910   7.750  53.916

Could someone help me to understand what factors (R or hardware) determine the difference between user time and elapsed time, and how I can improve it? In case it helps: I am running data.table data manipulation on a Macbook Air 1.7Ghz i5 with 4GB RAM.

Update: My crude understanding is that user time is what it takes my CPU to process my job. elapsed time is the length from I submit a job until I get the data back. What else did my computer need to do after processing for 8 seconds?

Update: as suggested in the comment, I run a couple times on two data.table: Y, with 104 columns (sorry, I add more columns as time goes by), and X as a subset of Y with only 3 keys. Below are the updates. Please note that I ran these two procedures consecutively, so the memory state should be similar.

 X<- Y[, list(Year, MemberID, Month)]

 system.time(
   {X[ , Month:= -Month]
   setkey(X,Year, MemberID, Month)
   X[,Month:=-Month]}
  )
   user  system elapsed 
  3.490   0.031   3.519 

 system.time(
 {Y[ , Month:= -Month]
  setkey(Y,Year, MemberID, Month)
  Y[,Month:=-Month]}
 )
   user  system elapsed 
  8.444   5.564  36.284

Here are the size of the only two objects in my workspace (commas added). :

object.size(X)
83,237,624 bytes

 object.size(Y)
2,449,521,080 bytes

Thank you

See `?proc.time`, as suggested by `?system.time`. Can you clarify what you find confusing about what's written there? — Joshua Ulrich, Dec 03 '12 at 17:49
My crude understanding is that user time is what it takes my CPU to process my job. elapsed time is the length from I submit a job until I get the data back. What else did my computer need to do after processing for 8 seconds? — AdamNYC, Dec 03 '12 at 17:54
Sounds like a job for a benchmarking package such as [rbenchmark](http://cran.r-project.org/package=rbenchmark) or [microbenchmark](http://cran.r-project.org/package=microbenchmark) — GSee, Dec 03 '12 at 18:10
Wouldn't this depend on what other processes you have running on your computer? Perhaps a stupid question, but maybe make sure you've quit other applications while doing your timing? — joran, Dec 03 '12 at 18:21
@GSee & Joshua: I'll create an example after my long process completed :). — AdamNYC, Dec 03 '12 at 18:25
@Joran: thanks for reminding me this. I just use R and the browser for this question :D. Btw, my CPU is multicore, so I wonder if I have a few apps running it should allocate these into different cores instead of jamming them to the core that R is using. — AdamNYC, Dec 03 '12 at 18:26
What is your swap usage when you see this behavior? With 4G of RAM and a single almost 2.5G object, I would expect some contention. You might just need more RAM. — Matthew Lundberg, Dec 07 '12 at 04:40
@MatthewLundberg Yes, normally, but `setkey` is carefully constructed to only need working memory as large as one column. It reorders column by column, doesn't copy the whole object. At least, that's the intention. So I'm intrigued what's happening here. — Matt Dowle, Dec 10 '12 at 09:35

Mike Sandford · Answer 1 · 2012-12-09T21:42:21.803

User time is how many seconds the computer spent doing your calculations. System time is how much time the operating system spent responding to your program's requests. Elapsed time is the sum of those two, plus whatever "waiting around" your program and/or the OS had to do. It's important to note that these numbers are the aggregate of time spent. Your program might compute for 1 second, then wait on the OS for one second, then wait on disk for 3 seconds and repeat this cycle many times while it's running.

Based on the fact that your program took as much system time as user time it was a very IO intensive thing. Reading from disk a lot or writing to disk a lot. RAM is pretty fast, a few hundred nanoseconds usually. So if everything fits in RAM elapsed time is usually just a little bit longer than user time. But disk might take a few milliseconds to seek and even longer to reply with the data. That's slower by a factor of of a million.

We've determined that your processor was "doing stuff" for ~8 + ~8 = ~ 16 seconds. What was it doing for the other ~54 - ~16 = ~38 seconds? Waiting for the hard drive to send it the data it asked for.

UPDATE1:

Matthew had made some excellent points that I'm making assumptions that I probably shouldn't be making. Adam, if you'd care to publish a list of all the rows in your table (datatypes are all we need) we can get a better idea of what's going on.

I just cooked up a little do-nothing program to validate my assumption that time not spent in userspace and kernel space is likely spent waiting for IO.

#include <stdio.h>
int main()
{
    int i;
    for(i = 0; i < 1000000000; i++)
    {
        int j, k, l, m;
        j = 10;
        k = i;
        l = j + k;
        m = j + k - i + l;
    }
    return 0;
}

When I run the resulting program and time it I see something like this:

mike@computer:~$ time ./waste_user
real    0m4.670s
user    0m4.660s
sys 0m0.000s
mike@computer:~$

As you can see by inspection the program does no real work and as such it doesn't ask the kernel to do anything short of load it into RAM and start it running. So nearly ALL the "real" time is spent as "user" time.

Now a kernel-heavy do-nothing program (with a few less iterations to keep the time reasonable):

#include <stdio.h>
int main()
{
    FILE * random;
    random = fopen("/dev/urandom", "r");
    int i;
    for(i = 0; i < 10000000; i++)
    {
        fgetc(random);
    }
    return 0;
}

When I run that one, I see something more like this:

mike@computer:~$ time ./waste_sys
real    0m1.138s
user    0m0.090s
sys     0m1.040s
mike@computer:~$

Again it's easy to see by inspection that the program does little more than ask the kernel to give it random bytes. /dev/urandom is a non-blocking source of entropy. What does that mean? The kernel uses a pseudo-random number generator to quickly generate "random" values for our little test program. That means the kernel has to do some computation but it can return very quickly. So this program mostly waits for the kernel to compute for it, and we can see that reflected in the fact that almost all the time is spent on sys.

Now we're going to make one little change. Instead of reading from /dev/urandom which is non-blocking we'll read from /dev/random which is blocking. What does that mean? It doesn't do much computing but rather it waits around for stuff to happen on your computer that the kernel developers have empirically determined is random. (We'll also do far fewer iterations since this stuff takes much longer)

#include <stdio.h>
int main()
{
    FILE * random;
    random = fopen("/dev/random", "r");
    int i;
    for(i = 0; i < 100; i++)
    {
        fgetc(random);
    }
    return 0;
}

And when I run and time this version of the program, here's what I see:

mike@computer:~$ time ./waste_io
real    0m41.451s
user    0m0.000s
sys     0m0.000s
mike@computer:~$

It took 41 seconds to run, but immeasurably small amounts of time on user and real. Why is that? All the time was spent in the kernel, but not doing active computation. The kernel was just waiting for stuff to happen. Once enough entropy was collected the kernel would wake back up and send back the data to the program. (Note it might take much less or much more time to run on your computer depending on what all is going on). I argue that the difference in time between user+sys and real is IO.

So what does all this mean? It doesn't prove that my answer is right because there could be other explanations for why you're seeing the behavior that you are. But it does demonstrate the differences between user compute time, kernel compute time and what I'm claiming is time spent doing IO.

Here's my source for the difference between /dev/urandom and /dev/random: http://en.wikipedia.org/wiki//dev/random

UPDATE2:

I thought I would try and address Matthew's suggestion that perhaps L2 cache misses are at the root of the problem. The Core i7 has a 64 byte cache line. I don't know how much you know about caches, so I'll provide some details. When you ask for a value from memory the CPU doesn't get just that one value, it gets all 64 bytes around it. That means if you're accessing memory in a very predictable pattern -- like say array[0], array[1], array[2], etc -- it takes a while to get value 0, but then 1, 2, 3, 4... are much faster. Until you get to the next cache line, that is. If this were an array of ints, 0 would be slow, 1..15 would be fast, 16 would be slow, 17..31 would be fast, etc.

http://software.intel.com/en-us/forums/topic/296674

In order to test this out I've made two programs. They both have an array of structs in them with 1024*1024 elements. In one case the struct has a single double in it, in the other it's got 8 doubles in it. A double is 8 bytes long so in the second program we're accessing memory in the worst possible fashion for a cache. The first will get to use the cache nicely.

#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
    double a;
} PartialLine;
int main()
{
    int i, j;
    PartialLine* many_lines;
    int total_bytes = MANY_MEGS * sizeof(PartialLine);
    printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(PartialLine));
    many_lines = (PartialLine*) malloc(total_bytes);
    PartialLine line;
    double x;
    for(i = 0; i < 300; i++)
    {
        for(j = 0; j < MANY_MEGS; j++)
        {
            line = many_lines[j];
            x = line.a;
        }
    }
    return 0;
}

When I run this program I see this output:

mike@computer:~$ time ./cache_hits
Striding through 8388608 total bytes, 8 bytes at a time
real    0m3.194s
user    0m3.140s
sys     0m0.016s
mike@computer:~$

Here's the program with the big structs, they each take up 64 bytes of memory, not 8.

#include <stdio.h>
#include <stdlib.h>
#define MANY_MEGS 1048576
typedef struct {
    double a, b, c, d, e, f, g, h;
} WholeLine;
int main()
{
    int i, j;
    WholeLine* many_lines;
    int total_bytes = MANY_MEGS * sizeof(WholeLine);
    printf("Striding through %d total bytes, %d bytes at a time\n", total_bytes, sizeof(WholeLine));
    many_lines = (WholeLine*) malloc(total_bytes);
    WholeLine line;
    double x;
    for(i = 0; i < 300; i++)
    {
        for(j = 0; j < MANY_MEGS; j++)
        {
            line = many_lines[j];
            x = line.a;
        }
    }
    return 0;
}

And when I run it, I see this:

mike@computer:~$ time ./cache_misses
Striding through 67108864 total bytes, 64 bytes at a time
real    0m14.367s
user    0m14.245s
sys     0m0.088s
mike@computer:~$

The second program -- the one designed to have cache misses -- it took five times as long to run for the exact same number of memory accesses.

Also worth noting is that in both cases, all the time spent was spent in user, not sys. That means that the OS is counting the time your program has to wait for data against your program, not against the operating system. Given these two examples I think it's unlikely that cache misses are causing your elapsed time to be substantially longer than your user time.

UPDATE3:

I just saw your update that the really slimmed down table ran about 10x faster than the regular-sized one. That too would indicate to me that (as another Matthew also said) you're running out of RAM.

Once your program tries to use more memory than your computer actually has installed it starts swapping to disk. This is better than your program crashing, but its much slower than RAM and can cause substantial slowdowns.

I'll try and put together an example that shows swap problems tomorrow.

UPDATE4:

Okay, here's an example program which is very similar to the previous one. But now the struct is 4096 bytes, not 8 bytes. In total this program will use 2GB of memory rather than 64MB. I also change things up a bit and make sure that I access things randomly instead of element-by-element so that the kernel can't get smart and start anticipating my programs needs. The caches are driven by hardware (driven solely by simple heuristics) but it's entirely possible that kswapd (the kernel swap daemon) could be substantially smarter than the cache.

#include <stdio.h>
#include <stdlib.h>
typedef struct {
    double numbers[512];
} WholePage;
int main()
{
    int memory_ops = 1024*1024;
    int total_memory = memory_ops / 2;
    int num_chunks = 8;
    int chunk_bytes = total_memory / num_chunks * sizeof(WholePage);
    int i, j, k, l;
    printf("Bouncing through %u MB, %d bytes at a time\n", chunk_bytes/1024*num_chunks/1024, sizeof(WholePage));
    WholePage* many_pages[num_chunks];
    for(i = 0; i < num_chunks; i++)
    {
        many_pages[i] = (WholePage*) malloc(chunk_bytes);
        if(many_pages[i] == 0){ exit(1); }
    }
    WholePage* page_list;
    WholePage* page;
    double x;
    for(i = 0; i < 300*memory_ops; i++)
    {
        j = rand() % num_chunks;
        k = rand() % (total_memory / num_chunks);
        l = rand() % 512;
        page_list = many_pages[j];
        page = page_list + k;
        x = page->numbers[l];
    }
    return 0;
}

From the program I called cache_hits to cache_misses we saw the size of memory increased 8x and execution time increased 5x. What do you expect to see when we run this program? It uses 32x as much memory than cache_misses but has the same number of memory accesses.

mike@computer:~$ time ./page_misses
Bouncing through 2048 MB, 4096 bytes at a time
real    2m1.327s
user    1m56.483s
sys     0m0.588s
mike@computer:~$

It took 8x as long as cache_misses and 40x as long as cache_hits. And this is on a computer with 4GB of RAM. I used 50% of my RAM in this program versus 1.5% for cache_misses and 0.2% for cache_hits. It got substantially slower even though it wasn't using up ALL the RAM my computer has. It was enough to be significant.

I hope this is a decent primer on how to diagnose problems with programs running slow.

I can't. It's a guess. The other post mentions that it's 3 million rows and 40 columns and say 50 bytes per column on average. That's 6GB. Doesn't fit into RAM on a 4GB machine for sure. So it's a fairly decent guess, but you're right, still a guess. — Mike Sandford, Dec 06 '12 at 22:36
50 bytes? In R, integer=4, numeric=8, character=4(32bit) or 8(64bit) because character strings are cached. So at most `3e6 * 40 * 8 / 1024^3` = 0.9GB + the amount for the unique strings (if any) added to R's global cache. IIUC. — Matt Dowle, Dec 06 '12 at 22:45
Well AdamNYC doesn't give us the full table definition so I've made the assumption that there's more to it that just ints and chars and string pointers. He could easily have a bunch of unique strings, dates, other things. Furthermore it's been my experience that when you see these big gaps between user+system time and elapsed time that it's IO as the culprit. Could I be wrong? Sure. — Mike Sandford, Dec 06 '12 at 23:05
The other thing I'll mention is that spending as much time in the kernel as in userspace is a pretty big red flag that the kernel is doing a lot of work. Why would the kernel be doing so much work? If you run time ls -l /usr/bin you should see roughly equal amounts of time spent in user and sys and a lot more than the sum of them spent on the total. Why is that? It takes a bunch of time to push stuff out to the terminal which is IO time. — Mike Sandford, Dec 06 '12 at 23:11
I was thinking basics, first. Might he have a virus checker scanning in the background, for example? Along the lines of @joran's comment. — Matt Dowle, Dec 06 '12 at 23:40
Matthew you got me there. Sadly I already nuked the moon. Whoops! — Mike Sandford, Dec 07 '12 at 00:14
Hi Mike and Matthew: I learned a lot from your discussion. Matthew: I did restart my computer and just run R with Firefox, and wait time is still long. Let me try it one more time later today and update you guys. I'll load var types as well. Is there a way to test if it is disk I/O for sure? — AdamNYC, Dec 07 '12 at 00:14
Adam, you could try making another file/data table with nothing more than the three columns you want to sort on plus an "id" column which is nothing more than an index. If that takes the same amount of time to sort it means the problem has more to do with the sort and less to do with IO. If it sorts much faster it means the problem might be IO. — Mike Sandford, Dec 07 '12 at 00:26
If there really is nothing else running, it might be L2 cache misses. I don't see where disk I/O would come into it really, unless you do have a lot of character data. Running the manipulation through `Rprof` might help, and run it twice too. Careful to drop the key before rerunning (`setkey(DT,NULL)`) otherwise it will just do nothing the 2nd time. Depends on how unsorted the original data is. As Mike says, need more info. Careful to take the minimum of 3 consecutive runs (or report all 3), since caching (OS and R) effects can easily muddle conclusions. — Matt Dowle, Dec 07 '12 at 01:03
What are the types of the key columns? If any character cols there, try converting to factor. It might also be the way the multi-column ordering is done [FR#2419 fastorder could iterate left to right](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2419&group_id=240&atid=978). Something reproducible would be great. — Matt Dowle, Dec 07 '12 at 01:10
@AdamNYC I know ou've restarted your computer but how do you know some system update / virus scan hasn't restarted on startup. Checking your system monitor is the way to check nothing is running. Sorry to labour the point but just want to rule that out before rolling sleaves up. — Matt Dowle, Dec 07 '12 at 01:16
Hi Mike & Matthew D: Please see my update. BTW, I checked my system monitor. There's no system update or virus scan. — AdamNYC, Dec 07 '12 at 02:00
+1 I think this is kinda great answer involving a lot of research. — Jonas Schäfer, Dec 09 '12 at 22:47
@AdamNYC Thanks for object.size output. We did ask for column types as well; e.g. output of `str(Y)` please. — Matt Dowle, Dec 10 '12 at 09:20
@MikeSandford Wow. +1 too, awesome. Adam now reports that `Y` is 2.3GB, and he has 4GB RAM. So, iiuc, you reckon the 2.3GB is being copied (for example), or working memory for the `setkey` could be 1GB-ish, causing the box to swap. Something like that? We're looking for memory usage, not cache inefficiency, then? Since the 2.3GB is a little over half his RAM. I'm aware that typing 'X' at the prompt is a copy (sadly, internally in R, it's a todo to look into it) and that 'print(X)' avoids that copy. Something like that? — Matt Dowle, Dec 10 '12 at 09:32
I think that the execution profile of Adam's program looks to be somewhere in between the program that reads from /dev/random and /dev/urandom so there may well be some userspace disk IO going on that takes some time. I REALLY thought it was definitely swapping at first (and that's still a likely culprit) but time spent swapping counts against 'user' as evidenced by page_misses (at least on Linux, OSX might be different). Based on these examples it still seems like his program is doing some intensive IO somehow. But I think you're also taking a good tack looking for unintentional copies. — Mike Sandford, Dec 10 '12 at 13:36
Adam if you had a few minutes to compile and run these programs and let us know what the 'time ./[program_name]' output is it would help us tell how OSX categorizes different kinds of delays. — Mike Sandford, Dec 10 '12 at 13:42
Hi all: thanks for the discussion. I have 2.3GB object (curiously, in Rda it is only 70MB), with 1GB wired memory, which I assume I can't use. I also have almost 1GB inactive that I might not have access to. @Mike: what would you like me to run? — AdamNYC, Dec 10 '12 at 16:04
If you could copy&paste, compile and run the first three little programs I wrote (waste_user, waste_sys, waste_io) that would help us determine if OSX reports elapsed time the same way as Linux does. — Mike Sandford, Dec 10 '12 at 16:15

What caused my elapsed time much longer than user time?

1 Answers1

Linked