We can always move data into the cache, if active, by simply performing a memory access.
We can prefetch a var by simply "touching" it ahead of time, we don't need a special instruction for that.
It's unclear what you mean by "control over the cache" as we can enable/disable it, set its mode, its fill/spill policy and sharing mode with other HW threads.
We can also fill the cache with data and by clever use of arithmetic force the eviction of a line.
Your assumption that programmers have to control whatsoever over the cache is then not entirely valid, though not incorrect: the CPU is free to implement any cache policy it wants as long as it respects the documented specification (including having no cache at all or spilling the cache every X clock ticks).
One thing we cannot do, yet, is to pin lines in the cache, we cannot tell the CPU to never evict a specific line.
EDIT As @Mysticial pointed out in the comments, it is possible to pin data into the L3 cache in newer Intel CPUs.
The PREFETCHT0
, PREFETCHT1
, PREFETCHT2
, PREFETCHTNTA
and PREFETCHWT1
instructions to which _mm_prefetch
is compiled to are just a hint for the hardware prefetchers if present, active, and willing to respect the hint1.
Their limited use cases3 come more from the finer control over the cache hierarchy level the data will stop in and the reduced use of the core resources2 rather than as way to move the data into the cache.
Once a line has been prefetched it is removed from the cache as any other line would.
1 These hardware prefetchers are usually triggered by memory access patterns (like sequential accesses) and are asynchronous with respect to the execution flow.
2 They are asynchronous by nature (the quickly complete locally) and may not pollute the core resources a load would (e.g. a register, the load unit and so on).
3 While one may think that a hint is at worst useless (if not respected) it can actually turns out that prefetch degrates the performance.