This isn't the classic "merging two sorted" lists questions, which is fairly trivial to do in linear time.
What I'm trying to do is merge two lists of (key, value)
pairs, already sorted by value
, where there are objects with the same key
in both lists: such objects should have their value
s merged (added), which may change their sort order. I'm primarily interested in how the sort can be efficiently performed using information from the already sorted lists, since the sort is the slowest part of this algorithm.
Let's take a concrete example. Imagine a List
of Student
objects:
class Student {
final String name;
final int score;
...
}
Given as input two List<Student>
sorted by score
, I'd like to create new merged list of students, where any student (identified by Student.name
) appearing in both lists appears once in the final list, with a score equal to the sum of their score in both lists. The original lists should be left unmodified.
E.g.,
List 1:
{"bob", 20}
{"john", 15}
{"mark", 14}
List 2:
{"bill", 11}
{"mark", 9}
{"john", 1}
Result:
{"mark", 23}
{"bob", 20}
{"john", 16}
{"bill", 11}
The merging itself (identifying students that appear in both lists) can be done in expected O(1) time using any O(1) lookup/insert structure such as HashMap
. What I'm most interested in is the sort step (although I don't exclude solutions that do the merging and the sorting at the same time).
The question though, is how do I efficiently re-sort such a list? The ordering of the existing lists clearly puts some constraints on the final position of elements in the merged list. For example, if a student is at position i
in the first list and j
in the second, he must appear among the first i + j
students in the merged list by a simple argument analyzing the maximum number of students that could have a higher score. It's not immediately clear if this information would be useful in sorting the list, however.
You can assume that in many cases students that score highly in one list score highly in the other. The algorithm should work when that is not the case, but it gives you some additional information about the distribution that may be useful, in addition to the fact that the lists are already sorted.
It seems like this type of operation would be common for any type of distributed query + sorting implementation. For example, imagine a "select state,count(*) group by state" type of query issue against a distributed system (to count the number of records in each state) - naturally you'd get a sorted list of (state, count) objects back from each node, and then you'd want to merge and re-sort those during the reduce operation. It seems silly to throw away all the work already done on the distributed nodes.
Quantitative Notes
I'm interested in the case where the lists to be merged and re-sorted are small: usually around 256 entries. The range of scores varies, from 0 to 100 in the some cases, up to about 0 - 10,000,000 in others. Of course, given the small number of elements, each operation will be fast in absolute time, even with naive algorithms - but performed billions of times, it adds up.
In fact, one of the answers below has proven that you can't, in general, do this better than a plain sort for increasing list sizes (i.e., taking n to be the combined list size) - but I'm actually more interested in doing this many times, for fixed size lists, with good empirical performance.