I have a embarrassingly parallel algorithm that'll run within a Parallel.ForEach block on my 4-core machine in about 5 minutes. When I run the same on an 8-cores machine it ran for much longer (I gave up after 10 minutes, so I don't know exactly how long). Since .NET uses a Shared-Memory architecture for this kind of thing, I'm guessing that access to main memory is creating a bottleneck.
So my question is, is there a way of making n-copies (where n is the number of available cores) of my data and assigning one copy to each core, thus removing the bottleneck?
What I oxymoronically basically want to do is something like distributed memory, but within the same machine.
UPDATE
I re-ran the code on the 8-core machine and while on my 4-core the CPU usage (via Task Manager) will max out for the duration of the run, on the 8-core machine the CPU usage ran at about 50-60% for the duration. I wonder if this is indicative of something?
UPDATE 2 Implemented MPI.NET in my program and I now get 100% CPU usage on all cores, plus I can access cores on other machines.