5

Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?

When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.

But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.

So has this behavior been seen by others with jobs about the same size in memory usage?

The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.

So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.

My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.

Jose Cortez
  • 320
  • 3
  • 10
  • Hyper-threading can sometimes affect your results. In order to maximize CPU usage you generally do not get complete processor cores in any instance except for the very largest ones. Which can result in performance loss due the scheduling overhead. – datasage Jun 09 '13 at 18:56
  • When i check out top, it looks as though hyperthreading is not in use. For example on EC2's cc2.8xlarge instance I'll run 16 processes (1 per core), and each processes will say it's using 100% of its core, but the overall CPU usage is at 50%. Are you saying that under the virtualization covers, it could be employing hyperthreading? – Jose Cortez Jun 10 '13 at 17:10
  • It could be, and that has caused problems in some cases. There have been some threads on the ec2 forums about this issue. – datasage Jun 10 '13 at 19:16
  • If you look at the underlying physical hardware of the `cc2.8xlarge` instance you will see that it has 2 processors with 4 cores each. To get 16 cores it has to use hyperthreading which can cause problems for some applications. – datasage Jun 10 '13 at 20:54
  • It has [2 x Intel Xeon E5-2670](http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-types.html), which is an 8 core chip. So that's 16 cores total, supposedly. – Jose Cortez Jun 10 '13 at 21:35
  • It looks like you are correct, I was seeing some conflicting information somewhere. – datasage Jun 10 '13 at 21:45
  • Are you sure `top` on a virtual machine will tell if it's hyperthreaded on the metal? – jeremyjjbrown Feb 21 '14 at 02:59

1 Answers1

1

You might be running into memory bandwidth constraints, depending on your data access pattern.

The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:

  1. Running one copy of the single-threaded program on your laptop takes X minutes to complete.

  2. Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.

  3. Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.

  4. Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.

If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.

E. Anderson
  • 3,405
  • 1
  • 16
  • 19