1

Since some days I am using MSVC 2013 and my application crashes when executing the following code (sparse matrix multiplied by vector, pseudo code: A = this * pVector):
complex<double> x = (A.getValue(lRow) + (mValues[lIdx] * pVectorB->getValue(lCol)));
Before I used MSVC 2005 and the application runs well.
The exception (First-chance exception at 0x000000014075D1D2 in psc64.exe: 0xC0000005: Access violation reading location 0xFFFFFFFFFFFFFFFF.) was thrown.
I track the assembly to:
addpd xmm6, xmmword ptr [rax+rbx*8]
It crash only with optimization /O2 (maximize speed) but not with no optimization /Od.
I can also avoid the crash when adding code (cout<<"bla bla") into the method pVectorB->getValue(lCol).
I believe it could be some problem with not initialized variables. But I could not find any. Therefore I look into the disassembly.
I check XMM6 and ptr [rax+rbx*8]. They are the same without crash (with cout<<"bla bla") and with crash.
Is there any thing more I should look for other then XMM6 and the value of ptr [rax+rbx*8]?
I am looking for the problem since quite some time but could not find any hint to track down the problem to the line of code I have to correct.
Any help is highly appreciated. Thank you.

The code for getValue:

template <class T> class Vector
{    const T& getValue(const int pIdx) const
    {
      if(false == checkBounds(pIdx)){
        throw MathException(__FILE__, __LINE__, "T& Vector<class T>::getValue(const pIdx): checkBounds fails pIdx = %i", pIdx);
      }
      return mVal[pIdx];
    }

bool checkBounds(const int pIdx)const
  {
    bool ret = true;
    if(pIdx >= mMaxSize){
      DBG_SEVERE2("pIdx >= mMaxSize, pIdx = %i, mMaxSize = %i", pIdx, mMaxSize);
      ret = false;
    }
    if(pIdx < 0){
      DBG_SEVERE1("pIdx < 0, pIdx = %i", pIdx);
      ret = false;
    }
    return ret;
  }
}



The allocation of mVal:

void* lTmp= calloc((4 * sizeof(complex<double>))+4, 1);
((char*)lTmp)[0]        = 0xC;
((char*)lTmp)[1]        = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 2]    = 0xC;
((char*)lTmp)[(4 * sizeof(complex<double>)) + 3]    = 0xC;
mVal= (void*)(((char*)lTmp) + 2)



SOLUTION:
As suggested it works without the 2 byte in front and behind the desired array (mVal). But it also works with multiple of 16byte before and after the array.

Martin
  • 11
  • 2
  • did you write the multiplication in assembly? – melak47 Feb 02 '14 at 13:15
  • can you post the source for the getValue method? – rohitsan Feb 02 '14 at 13:18
  • The multiplication is not assembly. It is pure c++ using complex funcitons. – Martin Feb 02 '14 at 13:21
  • Then I think it would be more helpful to show more of the source, a self-contained, compilable, snippet of code demonstrating the problem would be ideal. – melak47 Feb 02 '14 at 13:24
  • 1
    Please see http://stackoverflow.com/questions/13013717/array-error-access-violation-reading-location-0xffffffff There is most likely an alignment issue here. – rohitsan Feb 02 '14 at 13:34
  • The size of Vector A is 4. This means I reserved an array 4 times the size of complex. To check I do writing before of after the allocated array I reserve 2 extra byte in front and 2 extra byte at the end for later checking when freeing the array. Does this cause any trouble with the alignment? – Martin Feb 02 '14 at 13:44
  • It sounds like it. The auto vectorizer is using SSE instructions and I believe that data that is used by those instructions must be 16 byte aligned. Remove the extra 2 byte allocations before and after Vector and that should resolve the problem. VS2005 did not have the auto vectorizer capability. – rohitsan Feb 02 '14 at 13:51
  • Yes this works. But how could I check I write more then the reserved array? – Martin Feb 02 '14 at 14:09
  • Thanks rohitsan for your hint. Now I am using 16byte in front and behind the desired array and it works. – Martin Feb 02 '14 at 19:09

1 Answers1

1

When using memory operands to SSE instructions like addpd xmm6, xmmword ptr [rax+rbx*8] the memory operand must be aligned. There are unaligned load instructions: You could movdqu from [rax+rbx*8] and then operate on the register. But if you use the memory form, the alignment is important. The optimization flags probably changed the alignment of your array. Or it may have folded the load into a memory operand (which are faster in some cases) and caused the problem that way.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • 1
    Those folded/fused load instructions can also be slower in some cases. http://stackoverflow.com/questions/21134279/difference-in-performance-between-msvc-and-gcc-for-highly-optimized-matrix-multp. Using fused add/sub + load or mul + load seems to be MSVC's preference but I'm not sure it's a good idea. Unaligned loads are not slower anymore (since Sandy Bridge) anyway. – Z boson Feb 10 '14 at 09:49
  • @Zboson I know that GCC isn't super smart about those. For example, addition is commutative, but only one of the operands to the add intrinsic is eligible to be the memory operand. It does seem to be hit-and-miss as to whether it's faster. – Ben Jackson Feb 10 '14 at 17:22
  • I'm not sure I know what you mean that GCC isn't smart about those. GCC is doing exactly what I want with those intrinsics and nothing more. I prefer that to MSVC's approach which is to try and be clever which ends up being less efficient in my case. Perhaps a better solution would be to have folded/fused intrinsics so the programmer can choose rather than the compiler. Otherwise, the only solution is to write it in assembly. – Z boson Feb 11 '14 at 09:33