7

I have a list of items stored in a remote database which may be unsorted, and I want to sort them. The database accepts commands of them form:

move item1 before item2
move item3 after item2

So, given a list of the form:

[1,3,2,7,6,0,4]

...how can I get the sequence of moves:

move 2 before 3
move 7 after 6
move 0 before 1
move 4 before 6

I assume a modification of the bubblesort algorithm would work, but I'm specifically looking for the most efficient implementation that is still pythonic, and that generates the fewest move commands.

UPDATE: the list is 1000-10000 long, and all items are unique - no repeats. Only a very small number of items - 1-10 - will be in the wrong place at any given time. Time is a concern - it should take seconds, not minutes - but it does not have to be extremely fast.

UPDATE 2: I would also like to move each item only once

Realist
  • 509
  • 2
  • 13
  • How big is your database and how much is computation time a concern? – wnnmaw Jan 28 '14 at 15:35
  • Also, since you want steps to be specified by entry rather than index, are there repeat values? – wnnmaw Jan 28 '14 at 15:37
  • Additionally, is simply rewriting the database in a sorted fashion possible? – wnnmaw Jan 28 '14 at 15:38
  • Updated question with answers to the first two. Rewriting the entire DB sorted is possible but is extremely slow - I want to try shuffling the entries into "sorted" to compare. – Realist Jan 28 '14 at 15:40
  • something confusing - when you say `move 2 before 3`, you are referencing the actual values in the list. does this work on indices, or values? i.e. should the first really be `move 2 before 1` (since 2 is at index 2, and 3 is at index 1)? if indices, then a proper answer needs to keep track of what moving items around will do to the indices. – Corley Brigman Jan 28 '14 at 16:32
  • Sorry I see your point. The actual move command is by value rather than index. – Realist Jan 28 '14 at 16:38
  • If you're only going to be doing at most 10 moves per batch, I'm not sure it's worth coming up with a fancy algorithm to save maybe 1 or 2 moves. Just do an insertion sort and be done with it. With a mostly-sorted list of at most 10,000 elements, the calculations will be done in seconds, no problem. (Well, I will say that it's kind of *fun* to think about how to do better than insertion sort. It's just not necessary from a work or business perspective.) – John Y Jan 28 '14 at 17:56
  • But an insertion sort will move those items many times, won't it? – Realist Jan 28 '14 at 18:18
  • The point of insertion sort is that any item that needs to move will move directly to its final location. The classic inefficiency in the algorithm in old textbooks is that you are given a fixed-size array to work with, and to make room for the moved item, you may need to do a lot of shifting of elements ("swaps"), which is relatively expensive. But presumably your database `move` operation isn't going to work like that. For algorithmic purposes, surely you can think of your database as a linked list, and a `move` as a constant-time operation. – John Y Jan 28 '14 at 18:40
  • In case you missed it, @Abhijit's answer is an insertion sort. – John Y Jan 28 '14 at 21:42
  • @JohnY thanks for the clarification - my sorting algo-fu is weak – Realist Jan 29 '14 at 11:03

3 Answers3

3

As you want to reduce the number of move sequences, the optimal approach I can think of is to use binary search on a sorted list to determine the insertion point of each element. If any of the element is already in its correct position, you need not move it.

This will generate n - d sequence moves where n is the number of elements and d is the number of elements in its correct position.

  • For an already sorted list, number of sequence moves are n - d = n - n = 0
  • For a list where all the elements are in wrong position, number of sequence moves are n - d = n - 0 = n

Implementation

def gen_move(seq):
    from bisect import bisect_left
    out = seq[0:1]
    for elem in seq[1:]:
        index = bisect_left(out, elem)
        if seq[index] != elem:
            if index == 0:
                print "Move {} before {}".format(elem, out[index])
            else:
                print "Move {} after {}".format(elem, out[index - 1])
        out.insert(index, elem)
    print out

Demo

gen_move([1,3,2,7,6,0,4])
Move 2 after 1
Move 6 after 3
Move 0 before 1
Move 4 after 3
[0, 1, 2, 3, 4, 6, 7]

gen_move(range(10)[::-1])
Move 8 before 9
Move 7 before 8
Move 6 before 7
Move 5 before 6
Move 4 before 5
Move 3 before 4
Move 2 before 3
Move 1 before 2
Move 0 before 1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

gen_move(range(10))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Performance

In [5]: %timeit gen_move(range(10000, 0, -1))
10000 loops, best of 3: 84 us per loop

Time Complexity

sum(1 ln 1 + 2 ln 2 + 3 ln 3 + ..... n ln n) < O(n ln n)

Space Complexity

O(n)
Abhijit
  • 62,056
  • 18
  • 131
  • 204
  • I would be careful with saying things like “optimal approach”; it’s maybe a good idea, but you would have to give a proof that it’s the optimal solution. – poke Jan 28 '14 at 16:16
  • @poke: Please read my second paragraph. Do you need a more elaborate explanation? More over, this is optimal in terms of number of move sequence, but there is a overhead in generating these statements, which was not a constraint for this problem. Moreover as no of insertion is less, and this algorithm directly depends on the number of inversion, this seems to me an ideal approach. I would love to hear your opinion :-) – Abhijit Jan 28 '14 at 16:17
  • 1
    This doesn't produce the optimal move sequence. Consider `[2, 0, 1]`. – user2357112 Jan 28 '14 at 16:34
  • In theory pathological sequences like [6, 4, 2, 0, 1, 3, 5] trip this algorithm up, but as noted in my update, a very small number of items from a large list will be out-of-order at any time. Say 10 of 1000. – Realist Jan 28 '14 at 16:41
  • @Abhijit I was just saying that you should be careful with saying something is *optimal*; it wasn’t meant as an evaluation of your algorithm. Especially with operations that have side effects (moving one element might need to move all others too), “optimal” is difficult to define. – poke Jan 28 '14 at 17:33
  • Minor point, but why do you switch to printing "after" when moving to somewhere other than the beginning? If you had some kind of bidirectional move going, then I'd say there is a minor cosmetic benefit to saying "after" for moves toward the right. But you have a straight-up insertion sort here, with all moves to the left. – John Y Jan 28 '14 at 21:40
  • It doesn't seriously affect the usefulness of your answer, but your time complexity analysis is wrong. Python `list` insert is already O(n), so like any other insertion sort, your time complexity is quadratic. Also, the logarithm for the binary search is base 2, not e, so you shouldn't say `ln`. By convention, Big-O notation always uses `log` (with unspecified base) for any constant base anyway. – John Y Jan 29 '14 at 22:28
2
  1. Get data from remote database
  2. Sort them (just simple sort)
  3. Use difflib SequenceMatcher.get_opcodes to get replace/delete/insert/skip operations that transform original list to sorted list
  4. Transform these operations to "move X after/before Y" operations

I am a little bit worried about time complexity of difflib, so you should benchmark it for the expected data size. Faster alternative could be rolling-checksum algorithm (like librsync).

Messa
  • 24,321
  • 6
  • 68
  • 92
  • `Transform these operations to "move X after/before Y" operations`. Can you please elaborate how this can be achieved ? – Abhijit Jan 28 '14 at 16:15
  • Transforming difflib output to the allowed operations is not guaranteed to be easy, or even possible. – user2357112 Jan 28 '14 at 16:16
  • This does look like an interesting approach, but I did quickly write it up, and tend to agree - in particular the "replace" difflib operation seems to be hard to translate. Thanks for the idea though. – Realist Jan 28 '14 at 16:31
  • This is the seed of a pretty good answer, because `difflib` is likely to find transformations that cost fewer moves than insertion sort, especially as the data gets further away from being already sorted. According to the docs, the time complexity is only quadratic, which is reasonable for how well it works. But to be a really good answer, it does need a bit more guidance on how to extract moves from the opcodes. – John Y Jan 29 '14 at 22:35
  • @Realist: You know you don't have to do anything with "delete" operations, as they are implicit in your database's `move` operation. "insert" operations should be straightforward enough. Think of "replace" as bidirectional "insert". Each "replace" opcode contains one or more elements to move from a to b, and one or more elements to move from b to a. I think that should do it, but I have not tested this myself. – John Y Jan 29 '14 at 22:47
1

This is pretty naive, but seems to work:

xs = [1,3,2,7,6,0,4]
ys = []

for x in xs:
    for n, y in enumerate(ys):
        if x < y:
            print x, 'before', y
            ys.insert(n, x)
            break
    else:
        ys.append(x)

Result:

2 before 3
6 before 7
0 before 1
4 before 6
georg
  • 211,518
  • 52
  • 313
  • 390
  • This won't ever move an item forward. It may produce significantly suboptimal solutions when moving items forward is the most efficient option. – user2357112 Jan 28 '14 at 15:49
  • This is another kind of Python-style insertion sort. Not extremely different from Abhijit's answer, but using sequential search instead of binary search to find where to put the out-of-sequence elements. – John Y Jan 29 '14 at 22:43
  • @JohnY: the real problem with this approach that it doesn't handle `[9,1,2,3]` correctly (nor does the accepted answer). – georg Jan 30 '14 at 09:17
  • 1
    Well, handling it "correctly" can mean different things. Does it always find the optimal set of moves? No. But the algorithm to always find the optimal set of moves is actually quite complicated. Even `difflib` (mentioned in another answer) won't always find the optimal set of moves, and the algorithm it contains (a variation of Ratcliff-Obershelp) is fairly sophisticated. In the real world, handling a problem "correctly" consists of handling it "well enough", and not spending so much time on it that you could have written a master's thesis on the problem. – John Y Jan 30 '14 at 12:33
  • 1
    @JohnY: I reposted this question in a more abstract form: http://stackoverflow.com/q/21452645/989121. If you're interested, feel free to post your solution there. – georg Jan 30 '14 at 12:38
  • Re: the abstractified form of this question. Nicely stated. And the "longest increasing subsequence" thing (first posted answer to that question) seems so obvious once pointed out. It shouldn't be difficult to adapt that to working code. (I don't have a solution yet myself, but if none shows up by the time I get around to it, I'll post something in Python.) – John Y Jan 30 '14 at 14:33