Do all threads try to access memory at the same time leading to serial execution or do they all make their own copies or something else?
No if you want to do computations in parallel. For instance, to add an array in parallel you would do:
int idx = blockIdx.x * blockDim.x + threadIdx.x;
outArr[idx] = a[idx] + b[idx];
Each thread inside the grid will do two reads (on the right) from two different locations and one write to another location. All in global memory. You can let all threads read/write from the same location in global memory. However, to prevent race condition, you need to use atomic functions
.
Read/write from/to global memory can be slow (it's DRAM), especially if threads do not read from coalesed memory (i.e: if thread 0, 1, 2, 3 reads from 0x0,0x4,0x8,0xc then it's coalesed). To understand more about the CUDA memory model, you can read section 2.4 in the CUDA Programming Guide.
Hope that helps!