Let's look at two of your three lines of code:
sum[0] = sum[0]+a[tid] * b[tid]; //every thread write to the sum[0]
__syncthreads();
The first line contains a memory race. Every thread in the block will simultaneously attempt to write to sum[0]. There is nothing in the cuda execution model which can stop this from happening. There is no automatic serialization or memory protections which can stop this behaviour.
The second line is an instruction barrier. This means that each warp of threads will be blocked until every warp of threads has reached the barrier. It has no effect whatsoever on prior instructions, it has no effect whatsoever on memory consistency or on the behaviour of any memory transactions which your code issues.
The code you have written is irreversibly broken. The canonical way to perform this sort of operation would be via a parallel reduction. There are a number of different ways this can be done. It is probably the most described and documented parallel algorithm for GPUs. It you have installed the CUDA toolkit, you already have both a complete working example and a comprehensive paper describing the algorithm as it would be implemented using shared memory. I suggest you study it.
You can see an (almost) working implementation of a dot product using shared memory here which I recommend you study as well. You can also find optimal implementations of the parallel block reduction in libraries such as cub