0

My C language (linux) application needs 1 terabyte of memory as minimum. 8 Terabytes would be the best. How can I support such memory capacity in a server? One way to do it, is to build my own PCI card and put 128 GB DDR4 modules which is the largest available modules on the market right now. But I will have to rewrite all the malloc() calls and all addressing of the array variables. Is this possible to do in C in a transparent way? The only way I can think of is to use PCI-express-allocated memory for arrays only and to access them I will have to rewrite all functions in this way:

put(huge_array,index,&data); 
get(huge_array,index,&data);

just like in C++ get() and put() methods. But my app is not C++ , it is C.

What other alternatives do I have, that is not going to cost me my a lot of money + my shirt?

Nulik
  • 6,748
  • 10
  • 60
  • 129
  • 2
    Isn't this a server-building question, not a programming question? – user253751 Feb 13 '16 at 03:20
  • A server with 1 terabyte of memory is going to be expensive. I think the current max is 6 terabytes for Intel Xeon E7 v2 servers. – rcgldr Feb 13 '16 at 03:21
  • if it requires modification of GLIBC it is a programming question – Nulik Feb 13 '16 at 03:21
  • I need it the cheapest way possible. A DDR controller is an very cheap ASIC chip, it shouldn't be expensive to put terabytes of RAM in a server – Nulik Feb 13 '16 at 03:23
  • 4
    Assuming the server runs in 64 bit mode, why would any OS changes be needed to support 1 TB of ram? – rcgldr Feb 13 '16 at 03:23
  • 1 TB of ram is the minimum I could accept, but I need 8 TB or RAM for normal operation – Nulik Feb 13 '16 at 03:23
  • 8Tb? mmm... Xeon E5 2600 circa 2014 support up to 1.5Tb IIRC. You really have to check with Intel datasheet, because that's what integrated memory controller support, and beyond that you're out of luck. – Severin Pappadeux Feb 13 '16 at 04:02
  • 1
    I'm voting to close this question unless OP proves the problem can not be translated to a distributed memory model. – user3528438 Feb 13 '16 at 04:29
  • I dont' want to translate this to distributed memory model.This can be achieved in a single server with low latency. Why would I opt for millisecond access time against microsecond via PCI Express bus? – Nulik Feb 13 '16 at 04:40
  • Because it's more practical, period. – user3528438 Feb 13 '16 at 04:57
  • You mean microsecond RDMA latencies? If your algorithm is more latency sensitive than that it won't scale over 8 TB NUMA anyways. Plus, distributed gives you more compute per byte. – Jeff Hammond Feb 13 '16 at 06:31
  • However TBs of shared memory is not that impractical as I thought, because mainframes like system z and its predecessors has been around for half a century and they thrive because they provide such a solution: large scale computer with centralized OS and unified address space. So yes there are demands, there's a market, and there have been solutions. So the question becomes, how do you build a mainframe with TBs of memory, which IBM has proven is very feasible. – user3528438 Feb 13 '16 at 12:38

2 Answers2

1

Is this possible to do in C in a transparent way?

Yes, it is. There is a good trick in Linux, inherited from Solaris, I believe, where you could write your own allocator, put it into separate shared library and do

> LD_PRELOAD=mylib.so ./myapp

Another possible solution is to use malloc hooks, [check here] (http://www.gnu.org/savannah-checkouts/gnu/libc/manual/html_node/Hooks-for-Malloc.html), but this solution is specific to Linux/GLIBC

UPDATE

Take a look at Overriding 'malloc' using the LD_PRELOAD mechanism

Community
  • 1
  • 1
Severin Pappadeux
  • 18,636
  • 3
  • 38
  • 64
  • Will this support the rules of alignment for instructions like MOVNTDQA or similar? Because if it is not aligned to 128 bits, I am going to get a SEGFAULT. It seems that I will have to rewrite all the addressing of the variables, with get() and put() functions, otherwise, how would the CPU address a memory address which goes through the PCI express bus? – Nulik Feb 13 '16 at 04:00
  • @Nulik Why do you need to rewrite addressing? Whatever interface is (get, malloc), if data is in the your app address space, it should be transparent to the app. And because you're writing allocator, whatever alignment you provide, that would be it. – Severin Pappadeux Feb 13 '16 at 04:05
  • @Nulik another option for allocator is to make your card look like memory mapped file, and provide `mmap` interface and allocation. GLIBC malloc is using mmap if allocation is above some configurable limit http://man7.org/linux/man-pages/man3/mallopt.3.html – Severin Pappadeux Feb 13 '16 at 04:09
  • Papadeux , because the malloc() uses system call to assign physical memory to virtual address. Using PCI Express bus for memory allocation, will not give a valid address. The address will be on the PCI card rather than on physical memory location. – Nulik Feb 13 '16 at 04:14
  • @Nulik Aha! Got it. But what app will do with this address? It is not in app address space, so operation like `ptr += 10000000` and dereferencing `*ptr` will produce SEGFAULT. And how much real memory (RAM attached to memory controller) do you have? – Severin Pappadeux Feb 13 '16 at 04:29
  • The idea is to have 1 TB per PCI Express card. And use get() and put() methods to get the chunk that is located on the card. It is like you do get() and put() methods on C++ and you have to memory-copy everything. With your solution with mmap() will I have to write a kernel driver ? Because maybe it could be the way to do this transparently. – Nulik Feb 13 '16 at 04:33
  • @Nulik yes, if such is the case, you'd better got directly with `mmap`. You don't have to do put and get, VM subsystem should do it for you. Take a look at http://stackoverflow.com/questions/20293005/if-i-have-only-the-physical-address-of-device-buffer-pcie-how-can-i-map-this – Severin Pappadeux Feb 13 '16 at 04:35
1

If you can parallelize your work, AWS has r3.8xlarge instances with 244 GiB each for US$2.66 per hour (in the US East region, prices may differ for other regions)*.

Conversely, if you're not in a hurry, you could use a server with less memory but 1+ TiB of swap, without having to change malloc.

dan4thewin
  • 1,134
  • 7
  • 10
  • ooh,no, my company competes with these guys, I can not use Amazon. – Nulik Feb 13 '16 at 04:29
  • there are other cloud providers - just seems to me that renting the time, even if I had to change parts of my algorithm would be cheaper than a hardware solution. – dan4thewin Feb 13 '16 at 04:32