2

I'm looking for a sorting algorithm based on subset inversion. It's like pancake sort, only instead of taking all the pancakes on top of the spatula, you can just invert any subset you want. Length of the subset doesn't matter. Like this: http://www.yourgenome.org/sites/default/files/illustrations/diagram/dna_mutations_inversion_yourgenome.png So we can't simply swap numbers without inverting everything in between.

We're doing this to determine how one subspecies of fruitfly can mutate into the other. Both have the same genes but in a different order. The second subspecies' genome is 'sorted', i.e. the gene numbers are 1-25. The first subspecies genome is unsorted. Hence, we're looking for a sorting algorithm.

This is the "genome" we're looking at (though we should be able to have this work on all lists of numbers):

[23, 1, 2, 11, 24, 22, 19, 6, 10, 7, 25, 20, 5, 8, 18, 12, 13, 14, 15, 16, 17, 21, 3, 4, 9];

We're looking at two separate problems:

1) To sort a list of 25 numbers with the least amount of inversions

2) To sort a list of 25 numbers with the least amount of numbers moved We also want to establish both upper and lower bounds for both.

We've already found a way to sort like this by just going from left to right, searching for the next lowest value and inverting everything in between, but we're absolutely certain we should be able to do this faster. However, we still haven't found any other methods so I'm asking for your help!

UPDATE: the method we currently use is based on the above method 
but instead works both ways. It looks at the next elements needed 
for both ends (e.g. 1 and 25 at the beginning) and then calculates 
which inversion would be cheapest. All values at the ends can be 
ignored for the rest of the algorithm because they get put into the 
correct place immediately. Our first method took 18/19 steps and 148 
genes, and this one does it in 17 steps and 101 genes. For both 
optimalisation tactics (the two mentioned above), this is a better 
method. It is however not cheaper in terms of code and processing.

Right now, we're working in Python because we have most experience with that, but I'd be happy with any pseudocode ideas on how we can more efficiently tackle this. If you think another language might be better suited, please let me know. Pseudocode, ideas, thoughts and actual code are all welcome!

Thanks in advance!

2 Answers2

0

Regarding the first question: Do you know (and care about) which of the two strands the genes are on?

If so, you're in luck: This is called the inversion distance between signed permutations problem, and there is a linear-time algorithm for it: http://www.ncbi.nlm.nih.gov/pubmed/11694179. I haven't looked at the details.

If not, then unfortunately (as described on p. 2 of that paper) the problem is NP-hard, so it's very unlikely that any algorithm exists that is efficient (polynomial-time) in the worst case.


Regarding the second question: Assuming you mean that you want to find the minimum number of swaps needed to sort a list of numbers, you should be able to find solutions to this by searching here on SO and elsewhere. I think this is a clear and concise explanation. You can also use the optimal solution to this problem to get an upper bound for your first question: Any swap of positions i and j can be simulated using the two interval reversals (i, j) and (i+1, j-1). (This upper bound might be very bad, though, and in particular could be worse than your existing greedy algorithm.)

Community
  • 1
  • 1
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • in reality we're not really working with actual fruitflies, they're more used as a "story" to make our heuristics problem look more real-world. In reality, we're just looking to sort a given list of 25 numbers. EDIT: whoops, didn't know the enter key was send! Anyway, we are now able to do the sorting in 18 inversions using our first method, but we have been told by our professor that we should be able to do it in 13 inversions. The second question I'll clarify: if you have 1, 4, 6, 5, 3, 2 and you invert indices 2 to 5, you get 1, 4, 2, 3, 5, 6 and that counts as 4 moved numbers (genes) – Merel van den Hurk Apr 24 '16 at 08:09
0

I think what you're looking for for the second question is the minimum number of swaps of adjacent elements to sort a sequence, which is equal to the number of inversions in the sequence (where a[i] > a[j] and i < j).

The first question seems quite a bit more complicated to me. One potential heuristic might be to think of the subset inversion as similar to the adjacent swap of more than one element. For example, if you've managed to get a sequence to this position,

5,6,1,2,3,4,7,8

we can "adjacent swap" indexes [0,1] with [2,3] (so inverting [0,1,2,3]),

2,1,6,5,3,4,7,8

and then [2,3] with [4,5] (inverting [2,3,4,5]),

2,1,4,3,5,6,7,8

and arrive at a sequence that now has significantly less element inversions, meaning less single adjacent swaps are needed to now complete the sort.

So maybe attempting to quantify inversions (in the sense of a[i] > a[j] and i < j) of sections rather than single elements could help move in the direction of estimating or building a method for the first question.

גלעד ברקן
  • 23,602
  • 3
  • 25
  • 61
  • The first question is simply how many times we invert subsets. If we use our first method as stated above, working from left to right, with the given list of numbers/genes, we can do it in 18 inversions, but our professor has told us we should be able to do it in 13 inversions. The size of the inverted subset is, for the first problem, not important. Should we invert the entire sequence, that counts as just one inversion. For the second problem the size *is* important. Should we then invert the entire sequence, that would count as 25 moved genes. We're basically looking at two optimalisations. – Merel van den Hurk Apr 24 '16 at 08:16
  • @MerelvandenHurk I see what you mean, so my suggestion of adjacent swap would not be an accurate lower bound since a subset inversion can include non-adjacent swaps. When you say 18 and 13, I assume that is a specific permutation example - can you please add that example to the question or comment? – גלעד ברקן Apr 24 '16 at 13:37
  • That's a good observation, that does indeed refer to a specific given list. I've added it to the question. We have in the meantime found a way to do it a little quicker, in 17 steps with only 101 moved genes while our original method was 18 or 19 steps and moved 148 genes. Basically what we do is the same, only we work to *both* ends and first calculate which end value is closest to its desired position, i.e. calculate which inversion would be cheapest. An advantage of that is that sometimes by on of those inversions a value that was first far away comes closer to where it's supposed to be. – Merel van den Hurk Apr 30 '16 at 08:28
  • 17 is, however, still not good enough. I've found a way to reduce it to 16 steps and 100 genes, but I don't know how to translate that to code since it was purely my human anticipation at work. It concerned the last four genes to be sorted: 21 20 18 19 Now, as per the above described method, this would take 3 steps (first doing 18, then 21, then 20-19) and moves 7 genes. However, I, as a human, anticipated that if I just inverted 18-19 first, I could then invert all four at once solving the problem. That is 2 steps and 6 genes. I however have no idea how to make the computer see what I saw. – Merel van den Hurk Apr 30 '16 at 08:29
  • @MerelvandenHurk the example of `21 20 18 19` is exactly what I meant by "adjacent swap of more than one element" - the inversion is like swapping the groups [21,20] and [18,19] (as well as reversing the group elements). You can make a computer see that both 18 and 19 are less than 21 and 20 (and then recurse on the subgroups). That part is not so hard - my problem is with something like, `21 20 18 22 19` - how do you know to swap 19/22 to take advantage of the inversion that follows. By the way, you mentioned the actual data example but I don't see it in the question - can you add it? – גלעד ברקן Apr 30 '16 at 18:05
  • I see something went wrong. I was sure I added it to the question but I don't see it either. – Merel van den Hurk May 03 '16 at 10:56
  • Also, I see exactly what you mean. I have the exact same problem. For those four elements it might be doable as you said (I didn't exactly understand what adjacent swap meant, terminology failed me, that's why I didn't get it) but when you have more elements it's becoming increasingly difficult. We're noing trying to find out if we can get some info from the intersections between elements, both based on how many places they are from their desired position and how many elements will have to "cross over" to get to the right place. – Merel van den Hurk May 03 '16 at 11:12