3

I have a List<Long> elements. The list is huge. The list size would be around 1M records. I am trying to partition the list to get the min and max for each partition, which I will be passing them to the DB queries.
I am currently doing this in the following way:

int gridSize = 6;
List<Long> idList = findAllIds();
List<List<Long>> partitionedList = ListUtils.partition(idList, (idList.size() + gridSize - 1) / gridSize);
for(List<Long> ids : partitionedList) {
    LongSummaryStatistics idStats = ids.stream().collect(Collectors.summarizingLong(e -> e));
    repo.callToDb(idStats.getMin(), idStats.getMax());
}

The above code works fine. But I am unnecessarily collecting all the keys to a list and then partition and save them (which saves them to a different list) and then looping through them to get the idStats.

Is there a more efficient way to do this instead of creating these many lists and consuming so much of memory?

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
Vamsi
  • 619
  • 3
  • 9
  • 22
  • Are you referring to this [ListUtils](https://commons.apache.org/proper/commons-collections/apidocs/org/apache/commons/collections4/ListUtils.html) (which is part of [Collections](https://commons.apache.org/proper/commons-collections/) subproject of Apache Commons)? – Abra Aug 20 '23 at 03:55
  • If you are using ListUtils from Apache commons, then the inner lists are just *views* over the underlying list and not copies of the elements. – Thiyagu Aug 20 '23 at 03:58
  • Btw is `gridSize` supposed to denote the number of elements in a partition? Currently, the way you derive the `size` parameter (second argument to the `partition` method), the partition size is always 6 (`gridSize` value). Is this the intention? – Thiyagu Aug 20 '23 at 04:11
  • I think you need to explain what you are trying to do, and what the DB operation is. – tgdavies Aug 20 '23 at 05:27

4 Answers4

1

It is easy to prove that the problem is O(N). Every one of the N elements in the initial list could be either the minimum or maximum. So every one of them needs to be tested.

Therefore, for a single-threaded algorithm there is no performance benefit in partitioning the list ... compared to scanning the entire list in a single pass. If you have the original list in memory, partitioning it and scanning the partitions does not improve anything.

If you are going to use multiple threads, then partitioning the list and scanning each partition with a different thread is potentially beneficial. (You need to be able to create the partitions without copying the list. That can be done efficiently using sublist(...) ... at least with an ArrayList.)

If you are going to use streams, you don't need to do the partitioning yourself. A simpler way is to use parallel() and leave it to the stream infrastructure to do the partitioning ... which the spliterator can do without creating new lists for the partitions.

There isn't an explicit statement in the javadocs that the summarizing collectors will operate in parallel on a parallel() stream. However the javadoc for LongSummaryStatistics says:

"This implementation is not thread safe. However, it is safe to use Collectors.summarizingLong() on a parallel stream, because the parallel implementation of Stream.collect() provides the necessary partitioning, isolation, and merging of results for safe and efficient parallel execution."

In short:

List<Long> ids = findAllIds();
LongSummaryStatistics idStats = ids.stream()
        .parallel()
        .collect(Collectors.summarizingLong(e -> e));
repo.callToDb(idStats.getMin(), idStats.getMax());

(There is no need to use atomic types to do this because of the guarantees provided by the javadocs.)


On the other hand, if the real purpose of the partitioning is to partition the database queries, then your best bet would be to partition using sublist(...) and then submit tasks to an ExecutorService that will scan each partition and then do the DB query.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
0

I think that maximum performance that you are able to get is O(n).

E.g. in case you have a List<Long> with 1M elements (whatever this is not a big data):

@Getter
@RequiredArgConstructor
public final class LongSummaryStatistics {
    private final long min;
    private final long max;
}

public static LongSummaryStatistics calcMinMax(List<Long> ids) {
    long min = Long.MAX_VALUE;
    long max = Long.MIN_VALUE;

    for (long id : ids) {
        min = Math.min(min, id);
        max = Math.max(max, id);
    }

    return new Data(min, max);
}

Definitely you do not need to use stream to do this. This is pretty simple and native.

Then you are partitioning the list.

List<List<Long>> pIds = List.of(List.of(1L,2L,3L,4L,5L),
                                List.of(6L,7L,8L,9L,10L));

The performance will not change - O(n). Indeed, you do not need create a new list or modify it to calculate min, max.

List<LongSummaryStatistics> data = pIds.stream()
                                       .map(ids -> calcMinMax(ids))
                                       .collect(Collector.toList());

And finally. In case you just partition the list by the section and not mix the original order, e.g. [1,2,3,4,5,6,7,8,9] -> [[1,2,3,4,5,6], [7,8,9]] for gridSize = 6, then you even do not need to make a partitions, because you can calculate the result on the fly;

List<Long> ids = List.of(1L,2L,3L,4L,5L,6L,7L,8L,9L);
int gridSize = 6;
List<LongSummaryStatistics> data = calcMinMax(ids, gridSize); // [[1,6], [7,9]]



public static List<LongSummaryStatistics> calcMinMax(List<Long> ids,
                                                     int gridSize) {
    List<LongSummaryStatistics> data = new LinkedList<>();
    int i = 0;
    long min = Long.MAX_VALUE;
    long max = Long.MIN_VALUE;

    for (long id : ids) {
        min = Math.min(min, id);
        max = Math.max(max, id);

        i++;

        if (i % gridSize == 0 || i == ids.size()) {
            data.add(new LongSummaryStatistics(min, max));
            min = Long.MAX_VALUE;
            max = Long.MIN_VALUE;
        }
    }

    return data;
}

So you have O(n) time performance and O(1) memory performance. Which is pretty good.

Oleg Cherednik
  • 17,377
  • 4
  • 21
  • 35
0

Another approach using java 8 only way could be as below.The essential part is you just need to use SummaryStatistics collector before actually realizing ListCollector. For ways of partionioning you can have a look at this answer

        List<Long> idList = findAllIds();
        // List<List<Long>> partitionedList = ListUtils.partition(idList, (idList.size() + gridSize - 1) / gridSize);
        final AtomicInteger counter = new AtomicInteger();
            idList.stream()
                .collect(Collectors.groupingBy(it -> counter.getAndIncrement() / gridSize, Collectors.summarizingLong(e -> e)))
                .forEach((k,v)->repo.callToDb(v.getMin(), v.getMax()));
Sagar
  • 104
  • 1
  • 5
0

Assuming you are using Apache Commons Collections' ListUtils#partition method... the returned nested list elements are views of the original list (as it uses List.subList) and doesn't create copies of the list elements.

If you still want a simpler approach (without creating new (Sub)List instances), you can mimic what ListUtils#partition does and use an imperative approach to find the min and max as follows.

But before that, a word on your approach...

Usually we determine the number of elements in a partition (called the partition size) and then compute the number of partitions based on the size of the list. But in your code, the way you derive the value of size argument passed to the ListUtils#partition method, it looks like you have decided the number of partitions (= gridSize which is 3) and computing the number of elements in a partition. In the below answer, I'm sticking with the same approach as yours.

Here we have a helper function to find the min and max values within a range (start, end] of a list.

 //You can replace this with a class for JDK version < 16
record MinMax(Long min, Long max) {

}

private static MinMax findMinAndMaxForRange(List<Long> list, int start, int end) {
    if (list.isEmpty() || start < 0 || start > end || end > list.size()) {
        throw new IllegalArgumentException();
    }
    Long min = list.get(start);
    Long max = list.get(start);
    for (int i = start + 1; i < end; i++) {
        Long element = list.get(i);
        if (min > element) {
            min = element;
        }
        if (max < element) {
            max = element;
        }
    }
    return new MinMax(min, max);
}

Here, we have an IntStream range from (0, numOfPartition], and find the start and end range values for each partition and call the above helper function. The minMax values for each partition is then updated in the DB.

int gridSize = 3;
int size = (idList.size() + gridSize - 1) / gridSize;
int numOfPartition = (int) Math.ceil(idList.size() / (double) size);


IntStream.range(0, numOfPartition)
        .mapToObj(index -> findMinAndMaxForRange(idList,
                index * size,
                Math.min((index * size) + size, idList.size())))
        .forEach(minMax -> repo.callToDb(minMax.min(), minMax.max()));

Thiyagu
  • 17,362
  • 5
  • 42
  • 79
  • Thank you for the response. If you don't mind, can you explain how we can determine the partition size based on the number of elements in the list? – Vamsi Aug 20 '23 at 05:32
  • @Vamsi What is your requirement? Fix the number of partitions or number of elements in a partition? – Thiyagu Aug 20 '23 at 05:54
  • it's to have a fixed number of partitions. but I got curious to know how we can have a dynamic number of partitions based on list size after reading your response. – Vamsi Aug 20 '23 at 05:58
  • If you fix the number of elements in a partition (say `size`), then the `numOfPartition` would be `(int) Math.ceil(idList.size() / (double) size);` i.e., ` / ` but the last partition could be smaller when not equally divisible. – Thiyagu Aug 20 '23 at 06:34