On x86 processors is there a way to load data from regular write back memory into registers without going through the cache hierarchy?
My use case is that I have a big look up structure (Hash map or B-Tree). I am working through a large stream of numbers (much bigger than my L3 but fits in memory). What I am trying to do is very simple:
int result = 0;
for (num : stream_numbers) {
int lookup_result = lookup_using_b_tree(num);
result += do_some_math_that_touches_registers_only(lookup_result);
}
return result;
Since I am visiting every number only once and the sum total of all numbers is more than the L3 size I imagine that they'll end up evicting some cache lines that hold parts of my B-tree. Instead I'd ideally like to not have any numbers from this stream hit cache since they have no temporal locality at all (only read once). That way I can maximize the chances that my B-tree remains in cache and look ups are faster.
I have looked at the (v)movntdqa
instructions available in SSE 4.1 for temporal loads. That doesn't seem to be a good fit because it seems to only work for uncacheable write combining memory. This old article from Intel claims that:
Future generations of Intel processors may contain optimizations and enhancements for streaming loads, such as increased utilization of the streaming load buffers and support for additional memory types, creating even more opportunities for software developers to increase the performance and energy-efficiency of their applications.
However I am unaware of any such processor today. I have read elsewhere that a processor can just choose to ignore this hint for write back memory and use a movdqa
instead. So is there any way I could achieve loads from regular write back memory without going through the cache hierarchy on x86 processors even if it is only possible on Haswell and later models? I'd also appreciate any information on if this will be possible in the future?