Efficiently merging and re-sorting sorted lists

Question

This isn't the classic "merging two sorted" lists questions, which is fairly trivial to do in linear time.

What I'm trying to do is merge two lists of (key, value) pairs, already sorted by value, where there are objects with the same key in both lists: such objects should have their values merged (added), which may change their sort order. I'm primarily interested in how the sort can be efficiently performed using information from the already sorted lists, since the sort is the slowest part of this algorithm.

Let's take a concrete example. Imagine a List of Student objects:

class Student {
  final String name;
  final int score;
  ...
}

Given as input two List<Student> sorted by score, I'd like to create new merged list of students, where any student (identified by Student.name) appearing in both lists appears once in the final list, with a score equal to the sum of their score in both lists. The original lists should be left unmodified.

E.g.,

List 1:
{"bob", 20}
{"john", 15}
{"mark", 14}

List 2:
{"bill", 11}
{"mark", 9}
{"john", 1}

Result:
{"mark", 23}
{"bob", 20}
{"john", 16}
{"bill", 11}

The merging itself (identifying students that appear in both lists) can be done in expected O(1) time using any O(1) lookup/insert structure such as HashMap. What I'm most interested in is the sort step (although I don't exclude solutions that do the merging and the sorting at the same time).

The question though, is how do I efficiently re-sort such a list? The ordering of the existing lists clearly puts some constraints on the final position of elements in the merged list. For example, if a student is at position i in the first list and j in the second, he must appear among the first i + j students in the merged list by a simple argument analyzing the maximum number of students that could have a higher score. It's not immediately clear if this information would be useful in sorting the list, however.

You can assume that in many cases students that score highly in one list score highly in the other. The algorithm should work when that is not the case, but it gives you some additional information about the distribution that may be useful, in addition to the fact that the lists are already sorted.

It seems like this type of operation would be common for any type of distributed query + sorting implementation. For example, imagine a "select state,count(*) group by state" type of query issue against a distributed system (to count the number of records in each state) - naturally you'd get a sorted list of (state, count) objects back from each node, and then you'd want to merge and re-sort those during the reduce operation. It seems silly to throw away all the work already done on the distributed nodes.

Quantitative Notes

I'm interested in the case where the lists to be merged and re-sorted are small: usually around 256 entries. The range of scores varies, from 0 to 100 in the some cases, up to about 0 - 10,000,000 in others. Of course, given the small number of elements, each operation will be fast in absolute time, even with naive algorithms - but performed billions of times, it adds up.

In fact, one of the answers below has proven that you can't, in general, do this better than a plain sort for increasing list sizes (i.e., taking n to be the combined list size) - but I'm actually more interested in doing this many times, for fixed size lists, with good empirical performance.

I'm not following. Inserting into a TreeMap is definitely not O(1), it is O(log n). So n inserts into that map is going to O(n log n). Also, aside from a formal discussion of complexity, repeated inserting into a `TreeMap` is generally a pretty bad way to sort - slow and very memory hungry. — BeeOnRope, Jun 11 '16 at 22:15
I don't understand the problem. If you're merge sorting, then at some point during the merge, you'll have a duplicate at the pointers of both lists, at which point you can detect a duplicate and "merge" the two elements into one, advance the pointer of the deleted one and proceed.on. This wouldn't change the time complexity of the merge algorithm, it would just slow it very slightly. — Bohemian, Jun 11 '16 at 22:37
I'm not necessarily merge sorting: in fact I haven't specified a sort algorithm (that's kind of the crux of the question). I added a concrete example. It is not true that duplicates will necessarily appear during a traditional merge. Look at the example: "bob", "john" and "mark" would all be popped off the first list before bill in the second list. The second "mark" and "john" would not be detected and merged using that method. — BeeOnRope, Jun 11 '16 at 22:45
Do you know at the start how long both lists are? Do you know what the maximum and minimum possible scores are? — Jim Mischel, Jun 12 '16 at 06:20
Yes, and yes. You have List.size() for the size, and since the lists are sorted, the min and max are simply the first and last elements respectively. — BeeOnRope, Jun 12 '16 at 06:22
I asked because I had an idea for partitioning the initial output, but it just doesn't pan out. I thought I could save the first partitioning pass of Quicksort, but it doesn't save any time in the general case. — Jim Mischel, Jun 13 '16 at 17:57
Before the example, you describe `I'd like to create the two lists of students…` - the example shows _one_ `Result:`. I find that confusing. — greybeard, Dec 06 '16 at 18:05
@greybeard - second time's the charm? Hopefully it's clear now... — BeeOnRope, Dec 06 '16 at 18:37
Please edit the question to include additional information, e.g. what's relevant of bits strewn over [chat](http://chat.stackoverflow.com/rooms/129946/) & comments: like the ones "block quoted" by [Stephen C](http://stackoverflow.com/a/37770914/3789665) lately or `let's say the input lists are immutable (read-only)` — greybeard, Dec 08 '16 at 08:33
@greybeard - to be clear, the lists being immutable wasn't really a hard requirement. In general though, I'd assume that when someone describes a function of 2 inputs with an output, _unless otherwise stated_, the inputs should not be modified. I'm _also_ interested in cases where things are faster when modifying one of the inputs (i.e., an input List is also the output) - but I suspect that's not really the core of the problem. In any case, I specified now that the input lists are not to be modified. I also added some details on the problem size that interests me. — BeeOnRope, Dec 08 '16 at 19:48
From [a comment](http://stackoverflow.com/questions/37768906/efficiently-merging-and-re-sorting-sorted-lists/37770914?noredirect=1#comment69302895_37770914): `this sort, repeated billions of times, dominates runtime` - at a rate of one sort a second, a billion lasts for almost 32 years or millennia, depending on 1E9 or 1E12: what _are_ you trying to achieve? You may be asking for a solution for [problem Y](http://xyproblem.info/) (getting a sorted output list for sorted input lists when keys change due to _combination_) when a solution for problem X is needed. — greybeard, Dec 08 '16 at 21:20
`The range of scores varies` - range of a single score value, or the _count_ of score values (equalling combined input list size)? — greybeard, Dec 08 '16 at 21:23
@greybeard - yes, of course if a sort took 1 second, the cause would be lost. A sort should take more in the usec range. So billions is no problem (especially spread across cores or hosts). In aggregate I would expect it happens trillions (or more) times, but that's across many machines. I'm well aware of XY - this is very close to the root problem. — BeeOnRope, Dec 08 '16 at 22:35
@greybeard - by "range" I mean the domain of values that scores take. A single score value is a single value, evidently. The size is something else. What I mean is that if you have 200 scores, what are the max and min scores? It's mostly only important for radix sorts or variants thereof. — BeeOnRope, Dec 08 '16 at 22:36
Even _if_ the result is requested that often, you would only need to compute it if _inputs_ changed at a similar rate. Otherwise, just _cache like mad_. — greybeard, Dec 08 '16 at 23:09
@greybeard - this isn't a request/response system. It's part of an offline algorithm. The merge is generally always called with different arguments (except perhaps in degenerate cases which I can detect and solve differently). — BeeOnRope, Dec 08 '16 at 23:31
Inserting in an HashMap is constant for a single insert, but inserting a list in an HashMap in O(n), so the whole insert is not constant time. Using a TreeMap is memory and time consuming, but when n gets higher is much convenient both in time and space use. — Massimo Petrus, Dec 09 '16 at 07:08
Correct, inserting `n` elements is always going to take at least `n` time. I don't think I implied otherwise? It's clear that at a minimum the whole thing will take at least `O(n)` time, since the merge step takes at least that long. — BeeOnRope, Dec 09 '16 at 17:05

Stephen C · Answer 1 · 2016-12-08T07:56:43.003

It sounds like you need to use an adaptive sort algorithm.

"A sorting algorithm falls into the adaptive sort family if it takes advantage of existing order in its input. It benefits from the presortedness in the input sequence – or a limited amount of disorder for various definitions of measures of disorder – and sorts faster. Adaptive sorting is usually performed by modifying existing sorting algorithms." - Wikipedia article linked above.

Examples include insertion sort and Timsort; see the article above for more. Note that in Java 8, the Arrays.sort(Object[]) library method uses a modified Timsort.

I am not aware of any published algorithm that deals with the specific requirements of your example, but here is an idea:

Perform a classic merge on the two input lists L1 and L2:
- When you merge a pair of objects and it changes the keys that determine the ordering, put the merged object into temporary list A.
- Otherwise put the objects into temporary list B ... which will remain ordered.
Sort the temporary list A.
Merge lists A and B.

Assuming that:

the lengths of the original lists L1 & L2 are M & N respectively, and
the number of merged objects whose keys changed is R (which is less than max(M, N)),

then the overall complexity is O(M + N + RlogR). If R is small relative to M + N, then this should be an improvement.

In your example, every case where there is a match between elements in the input lists is likely to move the element in the order. If it moves the element, it will move to later in the order (and never earlier). So another idea is to do a three-way merge between a the original 2 lists and a priority queue. When you get a match, you merge the counts and add the result to the priority queue.

The complexity is similar to the previous, but you avoid extra pass to merge the lists. And also the RlogR becomes RlogA where A is the average size of the priority queue.

Keep in mind that I'm especially interested in the case where R is approximately equal to max(M,N), and also M == N.

(You didn't state that in your question! And, in fact it doesn't make any sense for R to be > min(M,N)!)

In that case, maybe just use the priority queue as an incremental sorter. Throw all merged records and all records that cannot be merged into the queue, and pull our records if when they have a key / score that is less than the current heads of the two lists. Assuming that M and N are the list lengths, and A is the average priority queue size, then the complexity is max(M,N) * log A). Whether this is an improvement on simple re-sort will depend on whether the average A is significantly (in Big O terms) less than max(M,N). That will depend on the inputs ... and the merging function.

The number (N) varies, but 256 to 1,000 is typical. Perhaps as much as 10,000.

For lists of that typical size, you are down at a level where the complexity analysis is not going to be helpful. But also, you are down at a level where optimization becomes pointless ... unless you are doing the operation many, many times, or on a tight "time budget".

This is all very approximate, and my maths are "sketchy" at best.

A proper investigation would entails hundreds of hours to research, code, test, benchmark, analyze various alternatives ... and we'd probably still get the answer that it depends on the input data set size and distribution.

Yes, although it can be more clever than a normal adaptive algorithm, since rather than just dealing with discovered order, it know exactly the type of order that is likely to appear, including hard bounds on the final correct location for elements. — BeeOnRope, Dec 06 '16 at 17:45
Your answer made me find [The Adaptive Priority Queue with Elimination and Combining](http://link.springer.com/chapter/10.1007/978-3-662-45174-8_28) (Calciu, I.; Mendes, H.; Herlihy, M.) — greybeard, Dec 06 '16 at 19:09
(Our answers have become very similar, including analysis result.) — greybeard, Dec 07 '16 at 10:51
`When you merge a pair of objects and it changes the keys` the combined object will have a combined key/priority (in the example: `score`). These objects are identified (for combination) by _something other than the key/priority/the sort criterion of the input lists_ (example: `name`), which is why the aren't detected "naturally" during merge. — greybeard, Dec 07 '16 at 10:57
(You killed a _b_: `of your example, ut here is`.) `every case where there is a match between elements in the input lists is likely to move the element later in the order (and never earlier)` - "mark" from 3/2 to 1, "john" from 2/3 to 3 - and the "expected" index in the result is about the relative position in the input list - with two lists of similar length, about twice the index in input. — greybeard, Dec 07 '16 at 12:36
Keep in mind that I'm especially interested in the case where R is approximately equal to max(M,N), and also M == N. In that case your suggestion unfortunately degrades to simply "sort temporary list A", and most of the other mechanics are for naught. — BeeOnRope, Dec 07 '16 at 18:43
R greater than min(M, N) would mean multiple objects with the same id in the longer input list (longer list contains _all id_s from the shorter list: min(M, N) - plus at least one "internal" duplicate…). — greybeard, Dec 07 '16 at 21:50
We know the input size and score range. The OP commented on my (deleted) answer: "In practice, the range of scores is in the 10,000 or 100,000s and the number of students is < 1,000...perhaps as much as 10,000." Can we really do better than a standard library sort, optimized by the language's developers, of less than 10,000 elements? (+1 btw) — גלעד ברקן, Dec 08 '16 at 00:43
@greybeard - R greater than `max(M,N)` is not possible in my problem, since all students are unique. Therefore, read "approximately equal" as "generally close to, but never greater than `max(M,N)`". — BeeOnRope, Dec 08 '16 at 19:50
About the problem size - yes, there are only a few elements - but this sort, repeated billions of times, dominates runtime. It does change the design space a bit: algorithms with small constants will do well here, and complexity analysis alone won't help (but to be fair, that is already true of sorts even of millions or billions of elements). — BeeOnRope, Dec 08 '16 at 19:52
@גלעדברקן - I'm _sure_ we can do better than the standard library implementors. For one, such sorts can't directly use the information we know about the result of the merge. For another, the algorithms have to be a compromise across various inputs, but here you _know_ the input size (generally small) which is an advantage, and you _know_ the exact sort for the not-common elements, and you know the approximate order for the common elements. — BeeOnRope, Dec 08 '16 at 19:57

score 5 · Answer 2 · edited May 23 '17 at 11:53

5

(Dismissing to first merge and then re-sort,) My first stab would be to declare the sorted input lists (semi-static) priority queues and proceed in two phases. To avoid an ambiguity in the term merge, I will call creating/altering an object to represent the values of "common objects" combine/combination; to reduce clutter, I'll denote priority queue PQ.

identify objects that appear in both/more than one "input queue"
(in a way of secondary interest here)
- combine (probably invalidating the position in either list),
- put them in another (dynamic) PQ (if necessary)
- remove from/invalidate in the (input) queue(s) where they shall be no longer.
Merge the PQs in the usual way

This should work in linear time in the number n of objects, plus O(c log c) for c "common" objects where the combined object would be out of sequence in place of any object combined. (…given expected constant time to (identify and) combine one (set of common) object(s) (see remark about expected O(1) in the question))
Then, I'm afraid that doesn't properly address the main point:

Is there a way to capitalise on the final key to be a (linear, monotone)
combination of at least one ordered sequence and "other values"?
(With lots of common entries - thinking all.)

If combination decreased priority monotonically (in the example, addition of (positive) score values increases priority), do without a combine phase and combine objects when merging PQs, potentially reducing memory as well as time required.
Otherwise, choose one PQ to take objects from (decreasing in priority), to potentially combine with other objects.
The "worst case" would seem priority of the combined objects showing no correlation: I'm afraid the answer is
generally, no. (see user2570465's answer for an explicit argument)
(as BeeOnRope points out, the (sequence of) objects picked being dominated in combination (disadvantageous choice) may actually turn into a good case if that can be detected and exploited.)
Then again, (linear, monotone) combination can be expected to skew the distribution of keys even without (positive) correlation (assumed in the question): be sure to use a (dynamic) PQ implementation where insertion in order is the best case rather than the worst:
For one, take an implicit heap in an array (children of element at index i are at 2i and 2i+1 (or 2i+1&2i+2 "not wasting element 0", but a bit more index manipulation):
just append items (with a distribution skewed to decreasing priority) to the end:
the expected number of exchanges with parent is below 1 (would be almost 1 without skew).

edited May 23 '17 at 11:53

Community

1
1

answered Dec 06 '16 at 18:48

greybeard

2,249
8
30
66

Note that there is no distinction between "existing values" and "values added", so I'm not quite following the _the "worst case" would seem to be that "the increasing values" are dominated by the values added_. Are you referring to the worst case for your algorithm, or the worst case for any algorithm? – BeeOnRope Dec 06 '16 at 18:51
I have been offering thoughts about the _worst case_ for _any_ algorithm - with the premiss of `lots of common entries` (_O(n)_). (`existing values`? referring to the values taken from _one_ sorted input list? The `values added` would be from the other list, where they don't need to be in the same order (else the question would be moot to begin with), but are "pre-existing" just as those in the first list.) – greybeard Dec 06 '16 at 19:02
Right - the interesting part of the problem is about the common entries. In fact, I think you can assume almost without loss of generality that _all_ entries are in common, since the not-common entries are trivially already sorted, and could be merged into the output of the algorithm that handles the common entries pretty easily. – BeeOnRope Dec 06 '16 at 20:08
... so then I don't totally follow your worst case argument - you treat one set as existing and the other as "added", and point out that when the added sets dominates the existing set doesn't have much information - but you could just as easily reverse this and call the other set existing and then everything is very easy! Or, more likely, you treat the lists symmetrically and the fact that one list sets the sort order is used implicitly. Or something like that :) – BeeOnRope Dec 06 '16 at 20:11
`you treat one set as existing` - I don't (search for `existing` in my answer). I _do_ presume to search for duplicates in other lists (boo: I'm thinking _k_ lists instead of two) for keys from _one_ list. – greybeard Dec 06 '16 at 20:12
... I understood that you made a distinction between "existing" and "added" starting from _are dominated by the values added: I'm afraid the answer is generally, no_ in your comment. Perhaps I misunderstood? Anyway the comments above apply regardless - if one of the two lists dominates the final behavior it would seem to be a _good_ case, since as long as you detect it, sorting is very easy. – BeeOnRope Dec 06 '16 at 20:15
Let's continue exchanging thoughts not _directly_ pertaining to answer (or question) [in chat](http://chat.stackoverflow.com/rooms/129946/merge-sorted-lists-with-keys-potentially-changing-on-merge). – greybeard Dec 06 '16 at 21:49
I awarded this answer the bounty. Although it doesn't go into the details of how the priority queues would work (in particular, how they can be efficiently implemented for almost-sorted data), it seems to come the closest to trying to use the existing sorted values to make things faster. – BeeOnRope Dec 14 '16 at 21:07

score 5 · Answer 3 · edited May 23 '17 at 11:46

5

It looks like you want a O(n) merge like they do with merge sort. I think I may have some bad news for you. I'm going to (hopefully) prove that you cannot do better than O(nlog(n)) for the generalized problem: (so consequently, you should just use any of the optimal O(nlog(n)) solutions presented by others). First, I'll start with the intuition as to why this is the case, and then I'll write an informal proof.

Intuition

The idea is to turn the problem of sorting a list into your problem and show that if you can solve your problem faster than O(nlog(n)), then I can sort any list faster than O(nlog(n)), which we know to be false. We'll just work with integers to keep things simple.

Suppose you have some strange sequence to be sorted: X = 1, 3, 2, -10, 5, 4, 7, 25. I will now construct two lists Dec and Inc. I start with 1 = 1 + 0 (i.e. x_1 = x_1 + 0). Then after that, if x_{i-1} -> x_i is an increase, I subtract 1 from my value in Dec and compute the necessary value in Inc to sum to x_i. If x_{i-1} -> x_i is a decrease, then I add 1 to my value in Inc and compute the necessary value in Dec to sum to x_i. We apply this algorithm to the sequence in the following table:

idx   x     Dec    Inc      
----------------------
 1 |  1  =  1   +  0
 2 |  3  =  0   +  3
 3 |  2  =  -2  +  4
 4 | -10 =  -15 +  5
 5 |  5  =  -16 +  21
 6 |  4  =  -18 +  22
 7 |  7  =  -19 +  23
 8 |  25 =  -20 +  45

Notice that I can convert from sorting to your problem in O(n) - note: reverse Inc in O(n) time to get two decreasing sequences. We can then input to your problem

A = {(1, 1), (2, 0), (3, -2), (4, -15), (5, -16), (6, -18), (7, -19), (8, -20)}
B = {(8, 45), (7, 23), (6, 22), (5, 21), (4, 5), (3, 4), (2, 3), (1, 0)}

Now if you can combine A and B into sorted order by the sum of their values (second element in the ordered pairs), and get something like

C = {(8, 25), (7, 7), (5, 5), (6, 4), (2, 3), (3, 2), (1, 1), (4, -10)

then you've essentially done an argsort (sort by index) of the initial sequence x_i. So if you solve your problem faster than O(nlog(n)), then I can sort faster than O(nlog(n)) by solving your problem first and then converting the solution to my problem of sorting a list. In particular, I would be sorting with complexity O(n) + O(complexity to solve your problem)

Statement to be Proven

Let your two key-value lists be

A = [(ka_i, va_i) | i = 1..n]
B = [(kb_i, vb_i) | i = 1..m]

sorted in decreasing order of value. You cannot find the combined list

C = [(ka_i, va_i + va_j) | ka_i = kb_j]

in faster than O(nlog(n)) time.

Proof Outline

The only assumption this proof makes is that you cannot sort a list faster than O(nlog(n)) time and this proof will proceed by providing a reduction that runs in O(n) time from sorting any arbitrary list to your problem.

In essence, we'll show that if we solve your problem faster than O(nlog(n)), then we can also sort any arbitrary list faster than O(nlog(n)). And we already know it is impossible to sort a list faster than nlog(n), so your desired solution must also be impossible.

Proof Details

For simplicity, we'll take sorting a list of integers. Let S = x_1, x_2, ..., x_n be any sequence of integers. We will now construct two lists, Dec and Inc.

We have three constraints:

Inc is strictly increasing
Dec is strictly decreasing
On iteration i of the algorithm, Inc[j] + Dec[j] = x_j for all j = 1..i-1

As their names imply, Dec will be strictly decreasing and Inc will be strictly increasing. We will maintain the invariant that x_i = Dec[i] + Inc[i] for i = 1..n

Here is the reduction:

# (Assume 1-indexed lists)
1. Initialize Inc = [x_1] and Dec = [0]
2. For i = 2..n:
    a. if x[i] > x[i-1] then
          Dec.append(Dec[i-1] - 1)
          Inc.append(x_i - Dec[i])
       else   # We must have x[i] <= x[i-1]
          Inc.append(Inc[i-1] + 1)
          Dec.append(x_i - Inc[i])

3. Create list A and B:
    A = [(i, Dec[i]) | i = 1..n]
    B = [(i, Inc[i]) | i = 1..n]
4. B = reverse(B) # Reverse B because B was in increasing order and we
                  # need both lists to be in decreasing order
5. A and B are inputs to your algorithm.
  If your algorithm can combine A and B into sorted order,
  then we have also sorted S (via argsort on the keys).

You're probably also hungry for a proof that my ad hoc method of choosing to increase Inc by 1 or decrease Dec by 1 works. Well here's an informal "proof" (you can formalize it by using induction):

Case x_{i} > x_{i-1}

Recall that in this case, we choose to decrement Dec by 1. We are given that x_{i} > x_{i-1} and we know that Dec_{i-1} + Inc_{i-1} = x_{i-1}. We can also say that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) = x_{i-1}.

Since x_{i} > x_{i-1}, we must have x_{i} >= x_{i-1} + 1. Therefore, x_{i} >= (Dec_{i-1} - 1) + (Inc_{i+1} + 1). Therefore, if we only decrement Dec by 1, we will be forced to add at least 1 to Inc, so Inc remains strictly increasing.

Case x_{i} ≤ x_{i-1}

Recall that in this case, we choose to increment Inc by 1. We are given that x_{i} <= x_{i-1} and we know that Dec_{i-1} + Inc_{i-1} = x_{i-1}. We can also say that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) = x_{i-1} and since x_{i} <= x_{i-1}, it must be the case that (Dec_{i-1} - 1) + (Inc_{i+1} + 1) <= x_{i}. Therefore, if we add 1 to Inc, we are sure that we must subtract at least 1 from Dec.

Conclusion

Your problem cannot be done faster than O(nlog(n)). You are better off just combining into a HashMap and then sorting its elements in O(nlog(n)) because it is impossible to find a faster solution.

Feel free to comment, though, if you find a problem with the reduction or have questions. I'm pretty sure it's correct. Of course, if I'm mistaken about sorting being no faster than O(nlog(n)), this whole proof falls apart, but last I checked, someone already proved that O(nlog(n)) was the fastest complexity for sorting. Comment if you prefer a formal reduction. It's getting late right now for me and I skipped some "formalizations", but I can get edit them in when I get a chance.

If you code up the algorithm for creating the reduction, you may gain a better understanding.

Also: see this post if you want an explanation for O(nlog(n)) bound on sorting What are the rules for the "Ω(n log n) barrier" for sorting algorithms?

edited May 23 '17 at 11:46

Community

1
1

answered Dec 09 '16 at 06:45

user2570465

2,437
2
18
22

Strictly speaking it's impossible to merge students in O(n) in worst case (ignore scores!). Students come in random order, but you need to create correlation. Nevertheless, hashmap with proper size will solve this merge faster than O(n\*logN), even not going around. Another part can be done in O(n) for some cases: for small scores (0-100) limits bin sort (bins for 0-200) will solve the problem in O(n). So... it's not that hopeless in general. – Alexander Anikin Dec 09 '16 at 08:42
I agree that it's impossible to merge randomly ordered students in O(n). However, it *may* be possible to merge them in O(n) given the partial ordering the OP provided. When you have the two lists of students sorted by decreasing order of score, the students are not longer "randomly ordered." And I think OP was hoping that could lead to an O(n) algorithm. Unfortunately, I think this proves it's not possible. – user2570465 Dec 09 '16 at 14:49
You can perhaps optimize it to O(n) with smaller scores, but OP says they go up to 10,000,000, so that's not feasible. If they were limited to 0-100, yes a bucket sort would solve this in O(n). Just hash the sum of the scores for the students, and you can order them from 0-100 in O(100n) = O(n). – user2570465 Dec 09 '16 at 14:55
It's certainly possible the merge the students in `O(n)` _expected_ time, since you can insert into a hash map in `O(1)` expected time. In _practice_ for my problem, we can also merge all the "students" in worst-case `O(n)` time since there is a simple mapping from student to small unique integer, so we can easily create a table without collisions (except of course when the students are the same). Basically a perfect hash function. – BeeOnRope Dec 09 '16 at 17:11
The other key problem is you assumed I was interested in the worst case behavior of my algorithm, when I specifically laid out the _expected_ conditions, such as correlation between student scores from one list to the next. So wouldn't an expected case analysis be better? Of course, I don't actually expect that - it is quite hard (and you already went beyond the call of duty with your awesome proof) - but just because you have a theoretical organization of input lists which are _difficult_ to sort doesn't mean it gives an interesting insight into the expected case. – BeeOnRope Dec 09 '16 at 17:25
Finally, be careful with assumptions. You say "... it looks like you want a O(n) merge like they do with merge sort." Absolutely not! I want a sort that is fast, period. As measured in wall-clock time. An algorithm with terrible worst-case performance which performs well in practice would be awesome (you can always put in a failsafe that detects when you are heading down the worse case path and then fall back to an algorithm with better worst-case behavior). So complexity analysis is only a tool that *might* help here, and worst-case complexity analysis may not have any nails to hammer at all. – BeeOnRope Dec 09 '16 at 17:28
The above proof is for the **average** case which is what you want. With regards to your comment about merging the students in O(n) expected time, that is **impossible**. If you can do that, I'll be sorting my lists in O(n) expected time as well. Theory lays out very hard boundaries. It's choice whether your accept them or not. – user2570465 Dec 09 '16 at 17:53
Overall, I think you are being very unreasonable by finding so many answers "unsatisfactory". Here is what you said to bsd that made me consider your problem in the first place: "I don't think any answer that gets n*logn is an achievement". I suspected there was no better solution than nlog(n) and I just proved it to you. So I think it would be more appropriate if you give the proof more weight now, as you are contradicting yourself. – user2570465 Dec 09 '16 at 17:54
If absolute runtime is so critical to your program, I also don't think Java is the right language for this problem. I highly suggest you use C/C++ for this problem (and add the C/C++ tag), as that will give you a huge improvement to your runtime. – user2570465 Dec 09 '16 at 17:59
I think Java is a fine language for this problem (and question), but absolutely the final algorithm may run in C++! Luckily, I expect most solutions to be fairly language agnostic. To be clear, this proof is a very important contribution to this answer, because it at least shows that you aren't going to come up with any guaranteed linear-scan like solution to the general case, and it helps guide the implementations in that way. I will have to digest your proof more carefully to understand that it applies to the average case, but to claim that you have to assume some distribution of the input? – BeeOnRope Dec 09 '16 at 18:05
About the merging in `O(n)` perhaps we are misunderstanding each other. By "merging" I mean taking two lists and creating an output list (in no particular order) with common elements combined (i.e., scores added). I'm not talking about the sorting. Do you agree that can be done in `O(n)`, at least for easily hashed inputs like I suppose. I wasn't claiming that the re-sorting can be done in `O(n)` time! – BeeOnRope Dec 09 '16 at 18:07
About "abosolute runtime", isn't that the only thing people really care about when talking about efficiently (perhaps in addition to related metrics like power use, memory consumption)? I mean people don't run algorithms with good theoretical complexity because they want to run algorithms with good theoretical complexity - they do it because the good complexity leads to good runtimes! – BeeOnRope Dec 09 '16 at 18:08
1

Thank you for being willing to give the proof more consideration. I believe the proof applies to the average case of sorting because it converts an "average" list to be sorted to your problem. I can't determine whether or not those mappings hit the "average" case in your problem - do you have a definition of your "average" case in your problem? Also, if you have a specific distribution of your two sets, you may be able to get improved results. For example, if at most 5 (or some constant) keys are common between the two lists, the problem becomes O(n). – user2570465 Dec 09 '16 at 18:17
Yes, we were misunderstanding about merging in O(n). I agree you can combine them into a HashMap. – user2570465 Dec 09 '16 at 18:23
As for "absolute runtime", I agree good complexity can lead to good runtime. But "runtime" is not uniform for different people. If I use a supercomputer that costs $1 billion, of course I'll beat you, even with worse complexity, if you have a cheap $500 laptop. So if all we considered was runtime, a invalid solution to your problem would be "just get a more expensive computer. Build all your memory with SRAM instead of DRAM. etc." – user2570465 Dec 09 '16 at 18:30
Also back to the average case, I think the best runtime you can achieve on your program will depend not on how many entires you have or on the size of the scores, but on the number of common keys between the two sets. You can see this is the case because if they shared no common keys, then you could merge them like in merge sort. With 1 common key, you could probably find the common key, do the O(n) sorted merge on the other keys, and insert the common key where it belongs. The proof relies on the fact that you can have N shared keys among the lists. – user2570465 Dec 09 '16 at 18:35
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/130264/discussion-between-beeonrope-and-user2570465). – BeeOnRope Dec 09 '16 at 23:59
I wanted to add that this is one of the best answers I've gotten to _any_ question on SO. I sound really negative above - the only real complaint is that it's answering a slightly different question than what I have (what's the lower-bound time complexity of this type of merge & sort, whereas I want to know how to implement it in a short runtime)... but I should have started with the positive: not too many answers come with a proof, especially not one which isn't just a copied or simplified version of an existing one. – BeeOnRope Dec 10 '16 at 00:07
Your proof by reduction is great! Before posting this question I had already informally decided that the lower bound (if using comparison) for this was likely to be nlogn, based on the intuition that any possible ordering of the students was possible (at least for inputs where the students were reversed like that) and hence a full sort was needed. Just reducing an arbitrary sort to my problem is a perfectly clean formal way of showing that! – BeeOnRope Dec 10 '16 at 00:29

score 0 · Answer 4 · answered Jun 11 '16 at 23:26

0

Maintain a map which is mapping of something unique to actual Student info.
```
Map<String, Student> scores = new HashMap<>();
```

Iterate through all the lists and put them into the scores map

for (Student s : list1) {
    if (scores.containsKey(s.name)) {
        scores.put(s.name, s.score + scores.get(s.name));
    } else {
        scores.put(s.name, s.score); 
    } 
}

Sort the entrySet using Java 8 streams

scores.entrySet()
  .stream()
  .sorted((s1, s2) -> (s2.getValue().score - s1.getValue().score)
  .map(s1 -> s1.getValue())
  .collect(Collectos.toList());

This is still O(N Log N)

You cannot sort it using the standard merge algorithm because the lists contain names whose position are not the same. The standard merge algorithm does not process the same element twice. After finding the duplicate and adding the student score, you need to re-sort. You are breaking the precondition for merge sort that both lists are sorted at all times by their values.

answered Jun 11 '16 at 23:26

bsd

2,707
1
17
24

Just to be clear, I never mentioned "merge sort" so I'm not sure why it is implied that I'm using it. Your approach seems very slow since the map has a random ordering and hence throws away all the information in the two sorted lists. I appreciate the details on how to merge the objects themselves, but that is trivial - I mentioned the `HashMap` based approach in my question. I will try to clarify that the question is about how to use the information from the already sorted lists to efficiently sort the merged list. – BeeOnRope Jun 12 '16 at 01:08
It is not very slow. I don't think you can get a O(N) approach. Your lists have incomplete info and positions can change. – bsd Jun 12 '16 at 15:13
It should be noted that there is a pretty big gap between O(n) and O(n*log(n)) and that in particular many of the partially sorted list algorithms complete in O(n*log(k)) where k is some parameter defining how sorted a list is. That aside, I'm also not _only_ interested in the complexity of the solution - a solution which is much faster in practice is also desirable. I don't think any answer that gets n*logn is an achievement: _any_ answer that falls back on any Java sort routine in the end gets that complexity. – BeeOnRope Jun 13 '16 at 20:37
... and so within the huge space of answers which all fall into O(n*log(n)) and don't go out of their way to introduce some unnecessary bottleneck, one that effectively randomizes the order of both lists (as your merge step does), completely throwing away all ordering information is among the slowest approaches. – BeeOnRope Jun 13 '16 at 20:41
k could be as large as n. I am interested in seeing your final algorithm. – bsd Jun 14 '16 at 01:32
Of course, yes. So in the edge case that `k==n` all comparison based algorithms cannot do better than O(NlogN). Just like in the edge case that Bogosort selects the best permutation right off the bat it is O(1). Keep in mind that k (all elements within a -k,+k sized region of their final position is *one* way of characterizing partially sorted arrays, which may not be ideal here. In particular, for this problem you generally have more range information about the elements at the start and end of the list than those in the middle. – BeeOnRope Jun 14 '16 at 02:10

Massimo Petrus · Answer 5 · 2016-12-09T07:53:28.630

It seems to me that any solution should generally fall in the category of O(n*log(n)) complexity (with n= length(L1)+length(L2), or n=max(length(L1), length(L2))).

My basic algorithm would be as follows

  Let's use two intermediate structures:
  - a TreeSet R, which guarantees ordering by rank, 
  - an HashMap M, which guarantees constant time insertion and retrieve 
  Call R's size n

  1 for each student in each list
      1.1 find the student in M by name (O(1)).
      1.2 if the student is found          
         1.2.1 find the student in R by its rank (O(log(n)).  
         1.2.2 remove the student from R (O(log(n))
         1.2.3 update the student rank 
      1.3 else 
        1.3.1. put the student in M O(1)
      1.4 put the student in R (O(log(n))
  2 At the end (if needed) transform the TreeSet in a list

The overall O complexity is O(n*log(n)),

Assuming L1 is the longest of the 2 lists, a small optimization would be avoiding to find the student when traversing L1, in this case the O complexity is the same, but you'll have less operations in absolute. The best case is of course when Len(L1)>>Len(L2).

There may be more complex solutions or better data structures to reduce operations number, but i don't think there may be a better O complexity as, basically you have 2 possibilities

1- mantaining the result list ordered, so scan lists, finding matches and recompute position each time

2- Using an intermediate map to lower the match finding complexity, then sort the result

Both the possibilities are usually computed in O(n*log(n))

`use [a] TreeSet R, [ordered] by rank` & `find the student in R (O(log(n))` - _how_? If `R` compares on rank/score/priority, it doesn't support _find by name_. — greybeard, Dec 09 '16 at 07:37
You're right. For that it's needed a TreeSet and an HashMap together. Then you can find the existing item in constant time from the map and get it by rank in O(log(n) time from the TreeSet. I'll change my answer — Massimo Petrus, Dec 09 '16 at 07:43
(`find the student in R by its rank (O(log(n))` presuming quasi-unique _rank_. With few values to pick, "finding the rank" gets faster, finding the student among all that have a score of, say, 42 doesn't.) — greybeard, Dec 09 '16 at 08:18
In this case it should be easy transforming the rank value in an unique one, for example appending to the rank the student name\code or a possible student id, or even the result of a function based on the student name. For simplicity i assumed there's a way to do it. In this case we may consider the complexity as depending on name's length too. But i believe this does not change a lot the matter from a 'formal' point of view — Massimo Petrus, Dec 09 '16 at 08:42

Ravindra HV · Answer 6 · 2016-12-10T12:00:20.523

0

As I see it, the fact that the list is already sorted by score does not help since first we need to merge the scores.

Also while using hash-map may seem to provide a O(1) seek, as per my understanding the underlying implementation will imply that in terms of throughput which includes creation of the hashmap, the efficiency will still be not as good (when compared to the one below).

The approach would be as follows:

Apply inplace-binary-most-significant-bit-radix-sort on List-1 and List-2 combined.
Students whose score appear twice will then be adjacent, merge such entries.
Finally use an inplace-binary-most-significant-bit-radix-sort (as above) on the scores of students in the merged list (such that the score-and-student pair are re-arranged as appropriate).

Update #1 : The sort in step 1 is on the student name.

edited Dec 10 '16 at 12:00

answered Dec 09 '16 at 20:09

Ravindra HV

2,558
1
17
26

Promising, if not building on pre-existing order in the input. – greybeard Dec 09 '16 at 21:17
Is the sort (1) on the scores or on on the student names? Sorting on the student names, just to merge them, seems like a bad idea. Radix sort, on the other hand, seems like it might be a good idea. – BeeOnRope Dec 10 '16 at 00:03
@BeeOnRope It is on the student names. Have updated the answer. As per understanding, the first requirement is to merge the scores based on names. As such the fact that the scores in the sub-lists are sorted does not hold any relevance as per my understanding. – Ravindra HV Dec 10 '16 at 12:02
It's relevant because the sort of the final list should also be by score. Sorting by names is never _necessary_ - although you could use it for merging I suppose. Merging is simpler than sorting however and it probably makes sense to use a straightforward `O(n)` algorithm to do it. – BeeOnRope Dec 10 '16 at 16:16
As [user2570465](http://stackoverflow.com/users/2570465/user2570465) is trying to explain, what you are looking for may not be possible (getting the task done in `O(n)`. – Ravindra HV Dec 11 '16 at 19:50

score 0 · Answer 7 · answered Dec 10 '16 at 17:38

Try it:

//Class Student modified.

public class Student {

        String name = "";
        int score = 0;

        public Student(String name, int score) {
            this.name = name;
            this.score = score;
        }

        @Override
        public boolean equals(Object v) {
            if (v instanceof Student) {
                return this.name.equals(((Student) v).name);
            } else if (v instanceof String) {
                return this.name.equals(String.valueOf(v));
            } else {
                return false;
            }
        }

        @Override
        public int hashCode() {
            int hash = 7;
            hash = 67 * hash + Objects.hashCode(this.name);
            return hash;
        }
    }

//Class CustomComparator to sort a list by object or stri

public class CustomComparator implements Comparator<Object> {

        public int orderby = 0;

        @Override
        public int compare(Object o1, Object o2) {
            Student st1 = (Student)o1;
            Student st2 = (Student)o2;
            if (orderby==0){
                //order by name.
                return st1.name.compareTo(st2.name);
            }else{
                //order by score.
                Integer a=st1.score;
                Integer b = st2.score;
                return a.compareTo(b);
            }

        }
    }

//Example

List<Student> A = new ArrayList<Student>();
A.add(new Student("bob", 20));
A.add(new Student("john", 15));
A.add(new Student("mark", 14));

List<Student> B = new ArrayList<Student>();
B.add(new Student("bill", 11));
B.add(new Student("mark", 9));
B.add(new Student("john", 1));

List<Student> merge = new ArrayList<Student>();
merge.addAll(A);
merge.addAll(B);

//Copy.
List<Student> result = new ArrayList<Student>();
for (Student st : merge) {
    if (result.contains(st)) {
        for (Student r : result) {
            if (r.equals(st)) {
                System.out.println(st.score + " > " +r.score);
                //Se the best score
                if (st.score > r.score) {
                    r.score = st.score;
                    break;
                }
            }
        }
    } else {
        result.add(st);
    }
}

//Sort result by name.
CustomComparator comparator = new CustomComparator();
comparator.orderby=0; //1 sort by score.
Collections.sort(result, comparator);
for (Student r : result) {
    System.out.println(r.name + " = " + r.score);
}

//The result example:

bill = 11 | bob = 20 | john = 15 | mark = 14

`result example: bill = 11 | bob = 20 | john = 15 | mark = 14 ` awesome. Is this output list supposed to be `re-sorted`? — greybeard, Dec 10 '16 at 20:55
Hello, in this example, the result is sorted by the student's name, so as to show alphabetically in the print. But in itself it is not necessary. — toto, Dec 10 '16 at 23:08