6

What is the best way to sort a collection while updating a progress bar? Currently I have code like this:

for (int i = 0; i < items.size(); i++)
{
    progressBar.setValue(i);

    // Uses Collections.binarySearch:
    CollectionUtils.insertInOrder(sortedItems, item.get(i));
}

This shows progress but the progress bar slows down as the number of items in sortedItems grows larger. Does anyone have a better approach? Ideally I'd like to use an interface similar to Collections.sort() so that I try different sorting algorithms.

Any help would be great!



As a bit of background, this code is pulling back lots of documents (1-10 million) from Lucene and running a custom comparator over them. Sorting them by writing data back onto the disk will be way too slow to be practical. Most of the cost is reading the item off the disk and then running the comparator over the items. My PC has loads of memory so there is no issues relating to swapping to disk, etc.

In the end I went with Stephen's solution since it was very clean and allowed me to easily add a multi-threaded sorting algorithm.

Luke Quinane
  • 16,447
  • 13
  • 69
  • 88
  • 1
    Does your progress bar have some definable max value? Because 30% of a 9000 length array is much different that 30% of a 90 length array. – nearlymonolith Oct 18 '10 at 01:54
  • @Anthony the max value for the progress bar is `items.size()`. I'm typically sorting millions or tens of millions. – Luke Quinane Oct 18 '10 at 01:57
  • I would not choose to sort tens of millions of items in memory. I'd more likely write them to a disk file and call an operating system sort. – Tony Ennis Oct 18 '10 at 02:02
  • As I said in my answer, the issue is that the time to insert a new item depends on the size of the collection, so it will by definition slow as the collection gets larger. See my answer for a more thorough explanation. – nearlymonolith Oct 18 '10 at 02:15
  • It can slowdown because of `item.get(i)` - not all collections are great for indexed access. Would be great if you could include more source code, specifically showing how items collection is declared. – Roman Oct 18 '10 at 04:36
  • During typicaly usage, roughly how many things do you have in sortedItems, and how many in items? There may be some way to get this down from 20 minutes. Can you tell if there's disk swapping going on as a result of the inserts? – johncip Oct 18 '10 at 08:28

7 Answers7

10

You want to be careful here. You've chosen to use an algorithm that incrementally builds a sorted data structure so that (I take it) you can display a progress bar. However, in doing this, you may have chosen a sorting method that is significantly slower than the optimal sort. (Both sorts will be O(NlogN) but there's more to performance than big-O behaviour ...)

If you are concerned that this might be an issue, compare the time to sort a typical collection using TreeMap and Collections.sort. The latter works by copying the input collection into an array, sorting the array and then copying it back. (It works best if the the input collection is an ArrayList. If you don't need the result as a mutable collection you can avoid the final copy back by using Collection.toArray, Arrays.sort and Arrays.asList instead.)

An alternative idea would be to use a Comparator object that keeps track of the number of times that it has been called, and use that to track the sort's progress. You can make use of the fact that the comparator is typically going to be called roughly N*log(N) times, though you may need to calibrate this against the actual algorithm used1.

Incidentally, counting the calls to the comparator will give you a better indication of progress than you get by counting insertions. You won't get the rate of progress appearing to slow down as you get closer to completing the sort.

(You'll have different threads reading and writing the counter, so you need to consider synchronization. Declaring the counter as volatile would work, at the cost of extra memory traffic. You could also just ignore the issue if you are happy for the progress bar to sometimes show stale values ... depending on your platform, etc.)


1 - There is a problem with this. There are some algorithms where the number of comparisons can vary drastically depending on the initial order of the data being sorted. For such an algorithm, there is no way to calibrate the counter that will work in "non-average" cases.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
2

Are you able to use an indeterminate progress bar? This still provides some feedback to the user that something is happening. Your code would look like this:

progessbar.setIndeterminate(true);
ArrayList sorted = new ArrayList(items);
Colletions.sort(sorted);

progessBar.setString("Hey you're done!");

I think you're going to get much better performance out of using the built in sort, rather than the binary insertion sort you are doing.

Greg Case
  • 3,200
  • 1
  • 19
  • 17
  • I could use an indeterminate progress bar but its not very friendly. Due to the nature of the items I'm sorting the whole process could take more than 20 minutes. – Luke Quinane Oct 18 '10 at 06:22
1

Why not implement your own merge sort (which is what Collections.sort is doing) and update the progress bar at key points of the algorithm (say, after each merge of more than 5% of the array)?

SimonC
  • 6,590
  • 1
  • 23
  • 40
  • Just about to say the same :) My math might be off, but I think you can up the bar by `((100% / (lg n)) / 2^d` after each merge, where `d` is the recursion depth. It's something like that, anyway. The point is that if you keep track of the depth, you can use it to figure out how much each individual merge operation contributes to the progress. – johncip Oct 18 '10 at 07:56
0

If you're just comparing sort times, print the time before and after the sort.

Predicting how long a sort in the wild will take is difficult. For some sorts it depends on the ordering of the input. I'd use i/(double) items.size() to generate a ratio of the work done and call it a fine day. You might choose to update the bar every items.size()/100 iterations. There no reason to slam the poor progress bar with useless updates.

Tony Ennis
  • 12,000
  • 7
  • 52
  • 73
  • His comments say he is using `Collections.binarySearch`, which states in the Javadoc that the input must be sorted – Phil Oct 18 '10 at 04:29
0

The issue here is the physical mechanism of sorting - as sortedItems grows larger, insertInOrder will, by definition, take longer, as it's most likely an O(n lg n) + O(n) operation (using binary search to find the next smallest item and then inserting the item). It's unavoidable that as your collection grows larger, inserting the next item in the proper place will take longer.

The only way to approximate a progress bar whose time increases linearly would be to use some approximation similar to the inverse of the lg function, as sorting the first 1000 items might take a time similar to sorting the last 10 (that is of course a generalization).

nearlymonolith
  • 946
  • 4
  • 7
0

I may have missed something because nobody else has mentioned it, but it sounds like the runtime types of your source List object is not an implementor of RandomAccess and therefore your Collections.binarySearch invocation is running in O(n) time. That would slow things down quite a bit, very noticeably so, when you so much as double the number of items to sort.

And furthermore, if you are using for example a LinkedList for sortedItems then insertion is also O(n).

If that's the case, it makes perfect sense that when you go from 1 million to 2 million items, your expected time will also roughly double.

To diagnose which of the 2 List objects is problematic

  1. If the progress bar is slow from the start, it's items; try using a different container, something tree-ish or hash-y
  2. If the progress bar gets slower and slower as it gets closer to 100%, it's sortedItems; same advice as above

Note that it can be both Lists that are causing the slowdown. Also this has nothing to do with a progress bar. The problem you described is algorithmic with respect to the sorting, not the updating of a progress bar.

Phil
  • 2,828
  • 1
  • 23
  • 20
0

One simple approach on the progress bar is this.

You can fix the number of calls to update the progress regardless of the item size by using mod. For example,

public void run(int total) {
    int updateInterval = total / 10;
    System.out.println("interval = " + updateInterval);
    for(int i = 0; i < total; i++) {
        if(i % updateInterval == 0) {
            printProgress((float)i / total * 100f);
        }
        // do task here
    }
}

private void printProgress(float value) {
    System.out.println(value + "%");
}

This will update the progress bar 10 times (or 9? check the boundary conditions) whether the size is 10 or 10 million.

This is just an example, adjust the values accordingly.

Adrian M
  • 7,047
  • 7
  • 24
  • 26