5

I am currently working on creating my own persistent array in java, which uses a binary search tree to store an collection of values.

I want to add a map method that takes a Function as an argument to generate a new array. I don't want to evaluate the functions unless the particular value is requested. This is pretty simple to do as lambdas are lazy evaluated. However, I also only want a function to be evaluated only once, even if the result is requested multiple times.

I could create a node that stores the supplier and updates the result when evaluated:

class Node<T> {

    private T value;
    private Supplier<T> supplier;

    public T get() {
        if (null != value)
            return value;
        value = supplier.get();
        return value;
    }
}

...where supplier is derived from a Function being applied to a value in an older version of the persisted array.

However, this is no longer a functional approach, and has the potential to cause errors in a multi-threaded system*. It also doesn't yield an advantage in the case where supplier returns a null value**.

Another approach would be return an instance of Node on a get call:

class Node<T> {

    private final Optional<T> value;
    private final Supplier<T> supplier;

    Node(Supplier<T> supplier, T value) {
        this.supplier = supplier;
        this.value = Optional.ofNullable(value);
    }

    public Tuple<Node<T>, T> get() {
        if (null != value)
            return new Tuple<>(this, value.orElse(null));
        T result = supplier.get();
        Node<T> newNode = new Node<>(null, result);
        return new Tuple<>(newNode, result);
    }
}

I like this approach for staying functional; but it would require a lot of overhead in all the parent nodes going up the tree for just a get call. And it would require cumbersome unboxing in the using application's code.

Does anyone have another approach they can think of to make this work the way I'm asking? Thanks.

*This could be solved using locking mechanisms, but adds a layer of complexity I'm hoping to avoid.

**I've thought about making value an Optional<T>, where null means hasn't been evaluate, and Optional.empty() as has been evaluated and returns a null. However, this works around my issue, rather than solving it.

For anyone not familiar with a persisted array, it's a data structure that creates a new instance of itself whenever an update is performed. This allows it to be purely immutable. Using a binary tree (or more common 32-bit tree) method allows the updates to reduce duplicating data, both in speed and memory.

EDIT:

The code for the collection can be found on github. For description of use, you can look in the test folder.

JRogerC
  • 598
  • 6
  • 17
  • 1
    I have not heard about `persisted array` but that sounds a lot like `CopyOnWriteArrayList` – Eugene Jun 12 '17 at 14:00
  • @Eugene I didn't know about CopyOnWriteArrayList. This is similar, but CopyOnWriteArrayList (from what I gather in the oracle docs) doesn't have the cost saving measures. – JRogerC Jun 12 '17 at 14:09
  • *cost saving measures*? what does that mean? – Eugene Jun 12 '17 at 14:12
  • The wikipedia article (https://en.wikipedia.org/wiki/Persistent_data_structure#Trees) does a decent job of explaining the approach I use, and its advantages. There are also some awesome youtube videos. If I can find them again, I post them. – JRogerC Jun 12 '17 at 14:18
  • 1
    "However, I also only want a function to be evaluated only once, even if the result is requested multiple times." - what you're asking is inherently imperative. In pure functional programming you have no way to know whether a computation was performed for the second time or not. – yeputons Jun 12 '17 at 14:32
  • Also, I do not see why interface of your first approach is "not functional". Your first `Node` is basically the identity function if you look at values returned only. FP and immutability do not prohibit you from keeping references to old data (they actually encourage). – yeputons Jun 12 '17 at 14:34
  • Finally, if you want something multithreaded, you'll have to either go with some synchronization primitives anyway (be it locks or CAS) or accept that a computation could be done several times concurrently. – yeputons Jun 12 '17 at 14:37
  • I suggest you don't use the word "array" to describe something that's not a Java array. Although in other languages "array" is used for lists, in the context of JAva it just leads to confusion. – slim Jun 12 '17 at 14:46
  • 1
    [Lazy field initialization with lambdas](https://stackoverflow.com/q/29132884/2711488)… keep in mind that a reference captured by a lambda expression is like a `final` field, so while the write to the field holding the lambda is not guaranteed to be immediately visible to other threads, the values are guaranteed to be consistent *if* seen by other threads. So in the worst case, every thread would initialize such a value one time, not seeing the computed value of other threads. If the actual calculation is functional, that shouldn’t matter. – Holger Jun 12 '17 at 15:04
  • @slim, I was actually thinking about that as I wrote the post. I tried to only use the word array when referring to the data structure as it's referenced in language agnostic cases. I'm glad your post is there for others to see. – JRogerC Jun 12 '17 at 17:08

3 Answers3

4

Disclaimer: this answer doesn't respond the question directly, as it doesn't use neither Supplier nor Optional directly in the Node class. Instead, a generic functional programming technique is presented that might help solve the problem.


If the problem is all about evaluating the function only once for each input value, then you shouldn't change your tree/array/nodes. Instead, memoize the function, which is a pure functional approach:

In computing, memoization or memoisation is an optimization technique used primarily to speed up computer programs by storing the results of expensive function calls and returning the cached result when the same inputs occur again

Here's a way to do it, inspired by this excellent article written by Pierre-Yves Saumont (please check it for an in-depth introduction to memoization):

public static <T, U> Function<T, U> memoize(Function<T, U> function) {
    Map<T, U> cache = new ConcurrentHashMap<>();
    return input -> cache.computeIfAbsent(input, function::apply);
}

Suppose you have a method that takes quite long to execute. Then, you could use the memoize method this way:

// This method takes quite long to execute
Integer longCalculation(Integer x) {
    try {
        Thread.sleep(1_000);
    } catch (InterruptedException ignored) {
    }
    return x * 2;
}

// Our function is a method reference to the method above
Function<Integer, Integer> function = this::longCalculation;

// Now we memoize the function declared above
Function<Integer, Integer> memoized = memoize(function);

Now, if you call:

int result1 = function.apply(1);
int result2 = function.apply(2);
int result3 = function.apply(3);
int result4 = function.apply(2);
int result5 = function.apply(1);

You'll notice that the five calls take ~5 seconds altogether (1 second for each call).

However, if you use the memoized function with the same input values 1 2 3 2 1:

int memoizedResult1 = memoized.apply(1);
int memoizedResult2 = memoized.apply(2);
int memoizedResult3 = memoized.apply(3);
int memoizedResult4 = memoized.apply(2); // <-- returned from cache
int memoizedResult5 = memoized.apply(1); // <-- returned from cache

You'll notice that now the five calls take ~3 seconds altogether. This is because the last two results are immediately returned from the cache.


So, back to your structure... Inside your map method, you could just memoize the given function and use the returned memoized function instead. Internally, this will cache the function's return values in the ConcurrentHashMap.

As the memoize method uses a ConcurrentHashMap internally, you don't need to worry about concurrency.

Note: This is just the beginning... I'm thinking about two possible improvements here. One would be to limit the size of the cache, so that it doesn't take the whole memory if the domain of the given function is too big. The other improvement would be to memoize the given function only if it hasn't been memoized previously. But these details are left as an exercise for the reader... ;)

fps
  • 33,623
  • 8
  • 55
  • 110
  • Just want to make sure I understand your code properly: You're wrapping the Memoizer around a single Function, correct? The original question stated that memoization was needed around `Supplier`s (The result of a `Function` applied to a static value). In that case, the use of a map may be overkill, if not unnecessary. – JRogerC Jun 12 '17 at 23:41
  • @JRogerC I think I'm not following you... The question says that you want to have a map operation that receives a function in order to create a new structure, but you don't want to evaluate the function more than once for the same input value. This code achieves that, without using suppliers or Optional inside your nodes. – fps Jun 12 '17 at 23:50
  • @JRogerC just to clarify... Memoizer is a higher-order function that takes one function as an argument and returns another function that uses the original function to calculate the result, but stores it in a cache map before returning it. The idea is that whenever you have a function, *any* function, you can replace the original function with the memoized one. Values returned by the function will be first calculated by using the original function, and subsequently returned from the cache. Sorry, but I don't understand why you need a supplier inside your nodes. – fps Jun 13 '17 at 00:14
  • But you don’t need that verbose `Memoizer` class, a single `static` method would be enough: `public static Function memoize(Function function) { Map cache = new ConcurrentHashMap<>(); return input -> cache.computeIfAbsent(input, function::apply); }`; the rest is just noise. – Holger Jun 13 '17 at 06:54
  • @FedericoPeraltaSchaffner, I was thinking about things on a per-node basis. Since each node only holds a single, static value, I wouldn't need to store more than one result for the function. But reading through your comment, I see that this solution is not node focused, but wholistically focused. if multiple nodes have the same value, this would be of even greater benefit. I apologize for not seeing what you were doing. – JRogerC Jun 13 '17 at 15:42
  • @JRogerC Thank you, yes, my idea with this answer was to give some background. Sorry, I know I didn't answer exactly what you were asking. You were focused on caching node values, while I was focused only in a function. My answer is quite general, and I admit that it might not respond your question directly. I will edit it as per Holger's comment, and I will clarify that this is a general technique that doesn't answer your question directly. Thank you very much. – fps Jun 13 '17 at 16:22
2

I also only want a function to be evaluated only once, even if the result is requested multiple times.

How about this?

class Node<T> {
    private Supplier<T> supplier;

    Node(T value, Supplier<T> supplier) {
        this.supplier = sync(lazy(value, supplier));
    }

    public T get() {
        return supplier.get();
    }
}

the sync method only synchronizing the Supplier once, when the target is called, the lock is disabled for the next continuous requests:

static <T> Supplier<T> sync(Supplier<T> target) {
    return sync(new ReentrantLock(), target);
}

static <T> Supplier<T> sync(ReentrantLock lock, Supplier<T> target) {
    //     v--- synchronizing for multi-threads once
    return lazy(() -> {
        // the interesting thing is that when more than one threads come in here
        // but target.delegate is only changed once
        lock.lock();
        try {  
            return target.get();
        } finally {
            lock.unlock();
        }
    });
}

the lazy method calls a given Supplier only once as below:

static <T> Supplier<T> lazy(T value, Supplier<T> defaults) {
    return lazy(() -> value != null ? value : defaults.get());
}

static <T> Supplier<T> lazy(Supplier<T> target) {
    return new Supplier<T>() {
        private volatile Supplier<T> delegate = () -> {
            T it = target.get();
            //v--- return the evaluated value in turn
            delegate = () -> it;
            return it;
        };

        @Override
        public T get() {
            return delegate.get();
        }
    };
}

IF you interested in how the final code was made, I have commit the code into github, you can just copy and use it. and you can found I have renamed lazy methods to once, which is more expressiveness.

holi-java
  • 29,655
  • 7
  • 72
  • 83
1

Disclamer: this is not a direct solution to what you are asking, but this answer presents a solution that's not mentioned in other answers and is worth trying.

You can use Suppliers#memoize from Google-guava library.

It will save you from the problem that you are facing when your Supplier returns null and it's also thread-safe.

Also note that the Supplier's memoize method returns com.google.base.Supplier which extends java.util.Supplier so you can use it to assign it to java.util.Supplier so that you don't force your clients (who will use your library) to depend on Guava library.

Lavish Kothari
  • 2,211
  • 21
  • 29