10

Although I have read about movntdqa instructions regarding this but have figured out a clean way to express a memory range uncacheable or read data so as to not pollute the cache. I want to do this from gcc. My main goal is to swap to random locations in an large array. Hoping to accelerate this operation by avoiding caching since there is very little data resue.

user229044
  • 232,980
  • 40
  • 330
  • 338
Kabira K
  • 1,916
  • 2
  • 22
  • 38
  • 1
    There's a way to this on Windows for sure. I'm not sure about GCC on Linux though. However, I'm not sure you want to declare read memory as uncachable. Although you won't pollute the cache, you (might) be paying the full memory latency for each access. – Mysticial Sep 14 '11 at 06:37
  • I agree that it may not improve the performance. But I would be nice to know how to use this feature. – Kabira K Sep 14 '11 at 21:25
  • 1
    There is no way to disable the cpu cache. Nor would you ever want to, it will make it horribly slow. Uncached memory reads easily take more than a hundred cycles. – Hans Passant Sep 18 '11 at 09:40
  • 2
    @Hans Passant: you're wrong with the first statement - e.g. on x86 architecture you can completely disable cacheing by setting PCD bit in CR3 register or just for some pages by setting the PCD bit for particular page. As for the second statement, Sandeep describes a situation where the cache can be really useless - although it may not speed up the main app, it can save precious cache for another threads. – Radim Vansa Sep 22 '11 at 19:07
  • And just how big is a "large array" ? 10MB, 100MB, 1GB, 10GB... more ??? – timday Sep 23 '11 at 22:51
  • 1
    @HansPassant Surely the cache can be disabled. You can use MTRRs for that, or Page Attribute Tables for that. – Gunther Piez Oct 09 '12 at 22:55
  • Linux non x86 specific version: https://stackoverflow.com/questions/885658/is-it-possible-to-allocate-in-user-space-a-non-cacheable-block-of-memory-on-li – Ciro Santilli OurBigBook.com Aug 24 '17 at 07:25

2 Answers2

8

I think what you're describing is Memory Type Range Registers. You can control these under Linux (if available and you're user 0) using /proc/mttr / ioctl(2) see here for an example. As it works on a physical address range I think you're going to have a hard time using it in a reasonable way.

A better way is to look at the compiler intrinsics GCC provides and find one or more, that expresses your intent. Have a look at Ulrich Drepper's series on "What every programmer should know about memory", in particular part 5 which deals with bypassing the cache. It looks like _mm_prefetch(ptr, _MM_HINT_NTA) might be appropriate for your needs.

As always when it comes to performance - measure, measure, measure. Drepper's series has excellent parts detailing how this can be done (part 7) as well as code examples and other strategies to try when speeding up the memory performance of your code.

user786653
  • 29,780
  • 4
  • 43
  • 53
2

All good advice from user786653; the Ulrich Drepper article especially. I'll add:

  • Uncached or not, the VM HW is going to have to look up page info in the TLB, which has a limited capacity. Don't underestimate the impact of TLB thrashing on random access performance. If you're not already, see the results here for why you really want to be using huge pages for your array data and not the teeny 4K default (which goes back to the days of "640K ought to be enough for anybody"). Of course if you're talking really huge arrays bigger than even a TLB full of 2MB pages can reference, even that won't help with this.

  • What have you got against the 'nt' instructions (e.g _mm_stream_ps intrinsic) ? I'm unconvinced declaring pages uncached will get you any better performance than appropriate use of those, and they're much easier to use than the alternatives. Would be very interested to see evidence to the contrary though.

Community
  • 1
  • 1
timday
  • 24,582
  • 12
  • 83
  • 135