How big a difference? Tiny differences are to be expected. Order of commutative operations matters for floating point computations. That is:
serialised_total = 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1
parallelised_total = (0.1 + 0.1 + 0.1) + (0.1 + 0.1 + 0.1)
# No actual parallelisation is performed. The above is just example of how
# the serialised summation could be broken up into two separate summations.
assert serialised_total != parallelised_total
# 0.6 != 0.6000000000000001
The results of each side of the equation are still very very close, they're just not exactly the same. See this answer for why.
If you are using the GPU then it will be making use of parallelisation, and so the order of operations will not be the same. For instance, if you sum a series of floating point values then you can speed things up by breaking the list up into chunks and sending each chunk to a different core to be summed. You can then sum the results of each chunk. This will be much quicker, but the order of operations will be different than if you summed the values serially.
In the above example, it is the "parallelised" total that is less accurate than the "serialised" total. This is not a rule, and sometimes it will be the "parallelised" total that is more accurate. For example:
# n = 8
serialised_total = 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1
parallelised_total = (0.1 + 0.1 + 0.1 + 0.1) + (0.1 + 0.1 + 0.1 + 0.1)
assert serialised_total != parallelised_total
# 0.7999999999999999 != 0.8
Without knowing more about your problem, any answers are just speculation about the issue. Including this one.