3

In my program I want to allocate 32 byte aligned memory to use SSE/AVX. The amount I want to allocate is somewhere around 2000*1300*17*17*4(large data set). I tried using functions _aligned_malloc() and _mm_malloc but for larger sizes it doesn't allocate memory and results in a access violation exception. If the amount allocated is small like around 512*320*4*17*17(small data set) then the code work fine.

Here these functions return a null pointer when allocation is done for large data set.But works fine when input data size is small. Also here if I just use unaligned memory allocation using new then code works fine for large data set too.
Finally Can someone tell me Is there any significant performance gains in using aligned memory for AVX.

Edit: After some research according to this post it says that new allocate memory from free store and malloc() allocate memory from heap. Here I am exceeding maximum heap size as _aligned_malloc() return errno 12 which means ENOMEM in that case Can someone tell me a work around for this.

Community
  • 1
  • 1
Aliya Clark
  • 131
  • 9
  • 1
    That's a little less than 10MB. How much (virtual?) memory do you have in your system? Can the system guarantee that it can allocate a contiguous chunk of memory (which is probably the problem you have)? – Some programmer dude Mar 01 '17 at 17:50
  • 1
    Oh and you *do* check if the allocation function returns a null pointer? – Some programmer dude Mar 01 '17 at 17:55
  • I have 12GB memory. I am sorry I need more than 10MB. its not 2000*1300. its 2000*1300*17*17. if I use new it works fine and there is no issue. However memory usage goes to its peak around 11.9. In advanced settings it says "Total paging file size for all drives is 15247MB" which is system managed for only drive C – Aliya Clark Mar 01 '17 at 17:57
  • Yes it returns a null pointer. Thanks for the tip. I'll modify the question. – Aliya Clark Mar 01 '17 at 17:59
  • SSE/AVX requires alignment of 32? Can you put that in the question? – Mooing Duck Mar 01 '17 at 18:06
  • What happens if you try to use regular malloc of `2000*1300*17*17 + 32`? If that fails, you know that it's a memory issue, and not an alignment issue. – Mooing Duck Mar 01 '17 at 18:07
  • MSDN says `_aligned_malloc` sets `errno` too. Assuming that's the right platform, was it set to ENOMEM or EINVAL? – Useless Mar 01 '17 at 18:07
  • @MooingDuck It worked fine when I allocate memory with `new`. But `malloc() doesn't work returns a null pointer. – Aliya Clark Mar 01 '17 at 19:11
  • @Useless it returns `ENOMEM` . Can some one please suggest me a solution for this. Is there a way to allocate memory free store. – Aliya Clark Mar 01 '17 at 19:29
  • @AliyaClark: I find it very surprising that `new` can allocate this ~3Gb, but `malloc` can't. More likely it's time-based, rather than the code itself. It depends on what other programs are running in the background. My advice would be to replace your algorithm with something that doesn't require 3Gb of contiguous memory. Prefer working with blocks of ~16Mb at a time. Consider using a file-backed memory map, and only map ~16Mb at a time. – Mooing Duck Mar 06 '17 at 18:32

1 Answers1

3

On memory allocation:

I seems you are actually trying to alocate 2000*1300*17*17*4 32 bytes elements. This is means you are trying to allocate 96 GB while your system has only 12 GB memory.

Since new is working but malloc not it seems your local implementation of new seems to be able to allocate huge amounts of virtual memory. Malloc allocates from the heap which means it is usally limited to the physical amount of memory you've got. That's the reason it fails.

As the dataset is bigger than your main memory you might want to allocate the memory using mmap which maps a file into virtual memory making it accessable as if it was in physical memory (but it will only partially be cached in memory). I'm not sure if it's guaranteed but mmap usally aligns on optimal page size boundary (almost always 4096 byte).

Anyway you will have a huge performance loss due to the fact that your disk is way slower than your RAM. This is so serious that using AVX will probably not speed up anything at all.

On the performance loss of using unaligned memory:

On modern hardware (say Intel's Haswell onwards I think) this depends on your access patterns. Unaligned access should have almost no performance overhead on iterating over the array in memory order (each cache line will still be loaded only once). If you access it in random order than you will often cross the 64 byte cache line boundry. This means your processor will have to load 2 lines into cache and remove 2 lines from the cache instead of only one. While this might be a serious problem for some situations in your case the disk will slows things down so much that you will barely notice this.

Addtional tips (or a shot in the dark):

The way you gave the size of the array (2000*1300*17*17*4) suggests that you are using a multidimensional array (e.g. auto x = new __m256[2000][1300][17][17][4]). So some tipps on that:

  • Iterate through it mostly sequential
  • Check if it is sparse (meaning some of the memory will never be accessed) and shrink it if possible.

You could try to flatten the array and do more complex index calculation yourself in order to reduce the amount of memory need. If you get it to fit completely into your RAM you can start to optimise your code (using AVX and/or aligned memory).

"Total paging file size for all drives is 15247MB" suggests that you actually using only parts of that 96 GB so there might be a way to further reduce your usage.

In that case you might also want to ask another question on how to reduce the memory usage with more info on what you are doing.

Community
  • 1
  • 1
Christoph Diegelmann
  • 2,004
  • 15
  • 26