1

I'm looking for a way to define a lazy pointer such that its data, say ptr[i], is only generated when it is called. That is, before calling ptr[i], the data is not in the memory or any other place. When ptr[i] is called, a callback function should be involved and get the value of ptr[i].

I want this pointer for I need to pass it to a C-style function in a third-party library(e.g. mean(double * ptr, size_t n) for computing mean value of a vector), so it must be a pointer and cannot be of any other type, but the data(possibly just random data for simulation) is extremely large and cannot be fit into the memory. For example, I want to simulate 100GB random double values and pass them to a mean function to compute its mean value and repeat the simulations 100 times.

The idea of lazy pointers may sound wired but it should be possible since I know it can be implemented through the virtual file system and the memory-mapped file. For example, I can define a few callback functions to get a virtual disk. The files in my virtual disk look like real files but actually its data is generated by my callback functions. Then I can use the memory-mapped file to get a pointer to the virtual file. By doing that all the call for ptr[i] will be handled by the system and passed to my predefined callback functions. Therefore I can get a lazy pointer out of it. However, this implementation is complicated than I expect and requires dependencies(Dokan for windows and fuse for Linux). I hope there is a simpler way to do it.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Jeff
  • 113
  • 1
  • 8
  • Create a class that has an `operator[]()` that resizes if given an invalid index, and returns a reference to the appropriate (after resizing) element. Giving it other operators to support other operations is trivial, but care is needed to avoid unintended implicit conversions. All easy to achieve with a class that wraps a `std::vector`. Using a memory mapped file can be handled by a custom allocator for the vector. – Peter Jul 26 '20 at 04:47
  • Thank @Peter for your answer, but what I need is a `double *` pointer. It cannot be other types for this is what the third-party function requires. – Jeff Jul 26 '20 at 05:09
  • A class can support an `operator double *()` that returns a pointer to a class's internals, and can be passed to your third party function. I was hinting, rather obliquely, at that in the comment about unintended implicit conversions - a simple named getter can do the same thing if implicit conversions are a concern. If using a `std::vector` named `v`, then `&v[0]` gives a pointer to that vector's internal data which can (say) be passed to C functions which know nothing about the C++ standard library. – Peter Jul 26 '20 at 05:43
  • If it is just *mean* you're calculating, then you do not need an *array*, you just need to implement a single-pass algorithm... – Antti Haapala -- Слава Україні Jul 26 '20 at 06:29
  • are you asking about interface or implementation? C or C++? a smart pointer or an array-like indexed type? – Red.Wave Jul 26 '20 at 08:26
  • @Red.Wave I am asking about the interface. I am not looking for any struct or class definition for the only acceptable data type is `double *`, I just want a pointer behaves like a complicated class object. Both C or C++ solutions are good but I will suspect the right answer is based on C interface. – Jeff Jul 26 '20 at 15:06
  • C is all about implementation. There is not much you can do about decoupling interface from implementation. In C++ you should decide what you need: a container or a smart pointer. And customizing proper interface won't interfere with implementation. So, for quicker and more accurate replies please mind the distinction between the to languages and choose proper and accurate terms. – Red.Wave Jul 26 '20 at 17:26

2 Answers2

2

Easiest solution would be to rewrite the 3rd-party library.


Other than that, you could possibly protect the memory, mprotect on Linux, and equivalent in Windows, and initialize each page as they're accessed. This requires a lot though, you would need to write a signal handler for SIGSEGV...

It is very tricky to get this right however, because if the 3rd-party library uses e.g. non-reentrant C library functions when the SIGSEGV occurs (it occurring within the function), then it would also mean that the code that generates the data cannot use any of the same functions... etc etc... Also your code generator would need to run within the signal handler.

Similar thing is achievable on Windows, but I do not know how, I just know it is... because back in time I researched the Unix solution (SIGSEGV interception + mprotect) for a working Windows code :D

  • Thank you! I feel this might be the right way to go. Would you be able to expand your answer a little bit and provide more info on it? I am not familiar with this area so it would be perfect if you can provide more terminology(or even link if you do not mind) for me to search. – Jeff Jul 26 '20 at 05:14
  • If you mean the *latter* part, it is pretty much *impossible* to get right. I am not sure you want to go there. Even the FUSE filesystem is *easier*. No, I have not done it myself, I know that there is a possibility for limited success, and I do not know how well it works... Especially if the program uses threads. – Antti Haapala -- Слава Україні Jul 26 '20 at 05:17
  • @Jeff I also think you should strongly consider rewriting the library, it looks way easier than to write fuse or mess with the kernel. You can write the function it to accept a generic type (template) and then it's easy to mock the data the way you want. – bolov Jul 26 '20 at 05:39
  • @AnttiHaapala Thank you very much for your answer! It is very helpful. I know it is difficult to implement but it should be a fun thing to explore. – Jeff Jul 26 '20 at 15:10
  • @bolov Thanks for your comment, but unfortunately using the pointer to access data has become a standard operation for my working environment(To be more specific, `R` language). Many packages use the pointer to access data and rewriting them will take a lot efforts. – Jeff Jul 26 '20 at 15:15
  • @Jeff happy hacking then :) – bolov Jul 26 '20 at 20:08
0

I'm not sure this is what you need, but you can have an overly complex design where operator[] is overloaded to write changes to a file, and only upon the actual call, these values will be read into memory.

class MyLazyPointer {
public:
    void operator[](const unsigned int i)
    {
        // write A[...] = ... to a file
    }
    double mean()
    {
        double *p;
        unsigned int n;
        readFromFile(&p,&n);
        return mean(p,n);
    }
}

Maybe you can explain what you really need: save memory space? other?

OrenIshShalom
  • 5,974
  • 9
  • 37
  • 87
  • Thanks for your answer, my primary focus is on saving memory space, I am writing a wrapper to a third-party library. My input is just a user-provided function to read a range of the data, so it is possible that the data is generated in real-time and not an on-disk file. I also do not have control over the third-party library, so I must have a `double* ptr` to pass to the thrid-party functions. My primary focus should be on saving memory space, that is why I would prefer not to read the entire data into memory – Jeff Jul 26 '20 at 05:00