Minimal cost string alignment with only swaps

Question

I've been working on this problem for my own edification, and I can't seem to crack it:

Given two strings that are permutations of the same superset of characters, find the minimum number of swaps needed to align them. The strings are circular (i.e. you may swap the first and last characters), may be upto 2000 characters long, and swaps may be performed on either string.

I tried several different approaches. One way was bubble-sorting both strings and caching all their intermediate states, then finding the intermediate state common to both that minimized the sum of the number of steps for each string to reach that intermediate state. That didn't yield a correct number (I have a few examples with the correct answer) and it obviously wouldn't have worked for very large strings. I'm kinda stumped now. I think I might need to use a modified version of the Demerau-Levenshtein distance, but how do you chose the minimal cost operation when only one type of operation is allowed?

Can someone point me in the right direction?

Do your swaps swap adjacent characters (circularly), or any pair of characters? — user2566092, Mar 04 '14 at 18:36
"swaps may be performed on either string" is a non-restriction, because if you swap in one string, you can swap just as well the same two characters in the other string at some point. — Niklas B., Mar 04 '14 at 18:58
The interesting part of this problem is the circularity. Without it the problem is simply counting number of inversions in a sequence(which can be done effectively using modified merge sort take a look [here](http://stackoverflow.com/a/20990301/812912)). Still the length restriction is quite low so maybe you can perform something for each cyclic shift? — Ivaylo Strandjev, Mar 04 '14 at 19:57
Also if you are allowed to swap any two elements not just adjacent ones [this post](http://stackoverflow.com/a/21455037/812912) will be relevant. — Ivaylo Strandjev, Mar 04 '14 at 19:59
We still have to make the total number of inversions zero but this time you also need to consider the position at an element needs to be put at! Now you can directly push it in the ordinary way swaps or circular,both of which can be calculated in constant time . Now you can keep a fenwick tree . if it comes cricularly increase the index counter of every letter in right by one else wise decrease the counter in left by 1 . Index + getting value for an index % sizeof string might give us correct index of a character in the string! — Shubham Sharma, Jun 29 '15 at 08:27

ldog · Answer 1 · 2014-03-05T02:49:26.813

Consider a dynamic programming solution based on the A* algorithm. The idea is to model this problem as a graph and find the optimal path to the goal node.

First some basic clarifications. Each node in the graph is a string, and two nodes are neighbours if and only if they differ by one (cyclic) swap of adjacent characters. The start node is the first string, and the goal node is the second string. Finding the shortest path from start to goal clearly gives you the optimal solution.

Now a discussion. The A* implementation is pretty generic, the only thing of interest here is the choice of heuristic function. I've implemented two heuristics, one is known to be admissible and therefore A* is guaranteed to the find the optimal path. The other heuristic I implemented I suspect should be admissible. All empirical attempts I have made to show it is non-admissible have failed which builds my confidence in it being admissible. The first heuristic is very slow, it seems to expand an exponential number of nodes with the size of the strings. The second heuristic seems to expand a much more reasonable number (perhaps polynomial in the size of the strings.)

Nevertheless this is a very interesting problem, and once I have a bit more time I may attempt to show theoretically that the second heuristic is justified.

Even the better_heuristic seems to expand a number of nodes exponential in the size of the string which does not bode well for either of these two heuristics.

Code included below. Please feel free to come up with your own heuristics and try them out. These are the ones that came to mind first for me.

#!/usr/bin/python

import random
import string
import copy
from Queue import PriorityQueue as priority_queue

# SWAP AND COPY UTILITY FUNCTION
def swap(c,i,j):
    if j >= len(c):
        j = j % len(c)
    c = list(c)
    c[i], c[j] = c[j], c[i]
    return ''.join(c)

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE MAXIMUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION.
# NOTE: THIS HEURISTIC IS GUARANTEED TO BE AN ADMISSIBLE HEURISTIC, THEREFORE
# A* WILL ALWAYS FIND THE OPTIMAL SOLUTION USING THIS HEURISITC. IT IS HOWEVER
# NOT A STRONG HEURISTIC
def terrible_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    return max(longest_swaps)-1

# GIVEN THE GOAL y AND THE CURRENT STATE x COMPUTE THE HEURISTIC DISTANCE
# AS THE SUM OVER THE MINIMUM NUMBER OF SWAPS FOR EACH CHARACTER TO GET 
# TO ITS GOAL POSITION DIVIDED BY SOME CONSTANT. THE LOWER THE CONSTANT
# THE FASTER A* COMPUTES THE SOLUTION, 
# NOTE: THIS HEURISTIC IS CURRENTLY NOT THEORETICALLY JUSTIFIED. A PROOF
# SHOULD BE FORMULATED AND THEORETICAL WORK SHOULD BE DONE TO DISCOVER
# WHAT IS THE MINIMAL CONSTANT ALLOWED FOR THIS HEURISTIC TO BE ADMISSIBLE.
def better_heuristic(x, y):
    lut = {}
    for i in range(len(y)):
        c = y[i]
        if lut.has_key(c):
            lut[c].append(i)
        else:
            lut[c] = [i]
    longest_swaps = []
    for i in range(len(x)):
        cpos = lut[x[i]]
        longest_swaps.append(min([ min((i-cpos[j])%len(x),(cpos[j]-i)%len(x)) for j in range(len(cpos)) ]))
    d = 0.
    for x in longest_swaps:
        d += x-1
    # THE CONSTANT TO DIVIDE THE SUM OF THE MINIMUM SWAPS, 1.5 SEEMS TO BE THE LOWEST
    # ONE CAN CHOOSE BEFORE A* NO LONGER RETURNS CORRECT SOLUTIONS
    constant = 1.5
    d /= constant
    return d


# GET ALL STRINGS ONE CAN FORM BY SWAPPING TWO CHARACTERS ONLY
def ngbs(x):
    n = set() # WE USE SET FOR THE PATHOLOGICAL CASE OF WHEN len(x) = 2
    for i in xrange(len(x)):
        n.add(swap(x,i,i+1))
    return n

# CONVENIENCE WRAPPER AROUND PYTHON's priority_queue
class sane_priority_queue(priority_queue):
    def __init__(self):
        priority_queue.__init__(self)
        self.counter = 0

    def put(self, item, priority):
        priority_queue.put(self, (priority, self.counter, item))
        self.counter += 1

    def get(self, *args, **kwargs):
        _, _, item = priority_queue.get(self, *args, **kwargs)
        return item

# AN A* IMPLEMENTATION THAT USES EXPANDING DATA-TYPES BECAUSE OUR FULL SEARCH
# SPACE COULD BE MASSIVE. HEURISTIC FUNCTION CAN BE SPECIFIED AT RUNTIME.
def a_star(x0,goal,heuristic_func=terrible_heuristic):
    visited = set()
    frontier_visited = set()
    frontier = sane_priority_queue()
    distances = {}
    predecessors = {}

    predecessors[x0] = x0
    distances[x0] = 0
    frontier.put(x0,heuristic_func(x0,goal))

    while not frontier.empty():
        current = frontier.get()
        if current == goal:
            print "goal found, distance: ", distances[current], ' nodes explored: ', len(visited)
            return predecessors, distances
        visited.add(current)

        for n in ngbs(current):
            if n in visited:
                continue

            tentative_distance = distances[current] + 1
            if not distances.has_key(n) or tentative_distance < distances[n]:
                predecessors[n] = current
                distances[n] = tentative_distance
                heuristic_distance = tentative_distance + heuristic_func(n,goal)
                frontier.put(n,heuristic_distance)



# SIZE OF STRINGS TO WORK WITH
n = 10

# GENERATE RANDOM STRING
str1 = ''.join([random.choice(string.ascii_letters + string.digits) for n in xrange(n)])

# RANDOMLY SHUFFLE
str2 = copy.deepcopy(str1)
l = list(str2)
random.shuffle(l)
str2 = ''.join(l)

# PRINT THE STRING FOR VISUAL DISPLAY
print 'str1', str1
print 'str2', str2

# RUN A* WITH THE TERRIBLE HEURISITIC FOR A KNOWN OPTIMAL SOLUTION
print 'a_star with terrible_heuristic:'
predecessors, distances = a_star(str1,str2,terrible_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

# RUN A* WITH A BETTER HEURISTIC THAT IS NOT JUSTIFIED THEORETICALLY
# TO BE ADMISSIBLE. THE PURPORSE IS TO COMPARE AGAINST THE KNOWN
# ADMISSIBLE HEURISTIC TO SEE EMPIRICALLY WHAT THE LOWEST WE CAN
# GO IS.
print 'a_star with better_heuristic:'
predecessors, distances = a_star(str1,str2,better_heuristic)
current = str2
while current != predecessors[current]:
    print current
    current = predecessors[current]
print str1

Actually, this can be solved using a different graph-theoretic approach much more efficiently using bipartite matching. A* is unnecessary. — ldog, Mar 05 '14 at 03:47

Minimal cost string alignment with only swaps

1 Answers1