Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

Question

I was looking through a programming question, when the following question suddenly seemed related.

How do you convert a string to another string using as few swaps as follows. The strings are guaranteed to be interconvertible (they have the same set of characters, this is given), but the characters can be repeated. I saw web results on the same question, without the characters being repeated though. Any two characters in the string can be swapped.

For instance : "aabbccdd" can be converted to "ddbbccaa" in two swaps, and "abcc" can be converted to "accb" in one swap.

Thanks!

I may be quite wrong here but isn't the minimum number of swaps equal to `[(the amount of character mismatches between both strings) / 2]`? — Geeky Guy, Aug 17 '13 at 18:48
In any case calculating the [Levenshtein Distance](http://en.wikipedia.org/wiki/Levenshtein_distance) between both strings will give you the amount of swaps (because in your particular case, all the operations detected by Levenshtein algorithm are swaps). — Geeky Guy, Aug 17 '13 at 18:49
@P0W two character mismatches, one swap solves it. Levenshtein distance is also two... Oh I see my mistake. Any Levenshtein distance calculated here would actually be double the right answer. — Geeky Guy, Aug 17 '13 at 19:06
@P0W So I'm pretty bad at counting. My point still stands :p — Geeky Guy, Aug 18 '13 at 03:07

score 8 · Accepted Answer · edited May 23 '17 at 12:01

This is an expanded and corrected version of Subhasis's answer.

Formally, the problem is, given a n-letter alphabet V and two m-letter words, x and y, for which there exists a permutation p such that p(x) = y, determine the least number of swaps (permutations that fix all but two elements) whose composition q satisfies q(x) = y. Assuming that n-letter words are maps from the set {1, ..., m} to V and that p and q are permutations on {1, ..., m}, the action p(x) is defined as the composition p followed by x.

The least number of swaps whose composition is p can be expressed in terms of the cycle decomposition of p. When j₁, ..., j_k are pairwise distinct in {1, ..., m}, the cycle (j₁ ... j_k) is a permutation that maps j_i to j_{i + 1} for i in {1, ..., k - 1}, maps j_k to j₁, and maps every other element to itself. The permutation p is the composition of every distinct cycle (j p(j) p(p(j)) ... j'), where j is arbitrary and p(j') = j. The order of composition does not matter, since each element appears in exactly one of the composed cycles. A k-element cycle (j₁ ... j_k) can be written as the product (j₁ j_k) (j₁ j_{k - 1}) ... (j₁ j₂) of k - 1 cycles. In general, every permutation can be written as a composition of m swaps minus the number of cycles comprising its cycle decomposition. A straightforward induction proof shows that this is optimal.

Now we get to the heart of Subhasis's answer. Instances of the asker's problem correspond one-to-one with Eulerian (for every vertex, in-degree equals out-degree) digraphs G with vertices V and m arcs labeled 1, ..., m. For j in {1, ..., n}, the arc labeled j goes from y(j) to x(j). The problem in terms of G is to determine how many parts a partition of the arcs of G into directed cycles can have. (Since G is Eulerian, such a partition always exists.) This is because the permutations q such that q(x) = y are in one-to-one correspondence with the partitions, as follows. For each cycle (j₁ ... j_k) of q, there is a part whose directed cycle is comprised of the arcs labeled j₁, ..., j_k.

The problem with Subhasis's NP-hardness reduction is that arc-disjoint cycle packing on Eulerian digraphs is a special case of arc-disjoint cycle packing on general digraphs, so an NP-hardness result for the latter has no direct implications for the complexity status of the former. In very recent work (see the citation below), however, it has been shown that, indeed, even the Eulerian special case is NP-hard. Thus, by the correspondence above, the asker's problem is as well.

As Subhasis hints, this problem can be solved in polynomial time when n, the size of the alphabet, is fixed (fixed-parameter tractable). Since there are O(n!) distinguishable cycles when the arcs are unlabeled, we can use dynamic programming on a state space of size O(mⁿ), the number of distinguishable subgraphs. In practice, that might be sufficient for (let's say) a binary alphabet, but if I were to try to try to solve this problem exactly on instances with large alphabets, then I likely would try branch and bound, obtaining bounds by using linear programming with column generation to pack cycles fractionally.

@article{DBLP:journals/corr/GutinJSW14,
  author    = {Gregory Gutin and
               Mark Jones and
               Bin Sheng and
               Magnus Wahlstr{\"o}m},
  title     = {Parameterized Directed \$k\$-Chinese Postman Problem and \$k\$
               Arc-Disjoint Cycles Problem on Euler Digraphs},
  journal   = {CoRR},
  volume    = {abs/1402.2137},
  year      = {2014},
  ee        = {http://arxiv.org/abs/1402.2137},
  bibsource = {DBLP, http://dblp.uni-trier.de}
}

Thanks David. I did not read your answer before posting my comment. I will go through the paper. — Subhasis Das, May 16 '14 at 05:49

score 4 · Answer 2 · edited May 23 '17 at 12:08

4

You can construct the "difference" strings S and S', i.e. a string which contains the characters at the differing positions of the two strings, e.g. for acbacb and abcabc it will be cbcb and bcbc. Let us say this contains n characters.

You can now construct a "permutation graph" G which will have n nodes and an edge from i to j if S[i] == S'[j]. In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G), which can be found out in O(n) time.

However, in the case where there are any number of duplicate characters, this reduces to the problem of finding out the largest number of cycles in a directed graph, which, I think, is NP-hard, (e.g. check out: http://www.math.ucsd.edu/~jverstra/dcig.pdf ).

In that paper a few greedy algorithms are pointed out, one of which is particularly simple:

At each step, find the minimum length cycle in the graph (e.g. Find cycle of shortest length in a directed graph with positive weights )
Delete it
Repeat until all vertexes have not been covered.

However, there may be efficient algorithms utilizing the properties of your case (the only one I can think of is that your graphs will be K-partite, where K is the number of unique characters in S). Good luck!

Edit: Please refer to David's answer for a fuller and correct explanation of the problem.

edited May 23 '17 at 12:08

Community

1
1

answered Aug 18 '13 at 01:49

Subhasis Das

1,667
13
13

Does the hardness result for cycle packing hold in the family of graphs where, for each vertex, the in-degree equals the out-degree? – David Eisenstat May 15 '14 at 15:04
I'm confused by "In the case of all unique characters, it is easy to see that the required number of swaps will be (n - number of cycles in G)" If all characters are unique, then every vertex in G is incident on exactly 1 edge, so there will be 0 cycles. Separately, does the O(n*log(n)) time refer to testing whether there are any duplicates? That can be done in linear time. – j_random_hacker May 16 '14 at 05:19
I am sorry for the confusion. Consider S = "bcdef" and S' = "cbfde". In this case, G has an edge between (1->2) (1st character in S = 2nd character in S'), (2->1), (3->4), (4->5), (5->3). So there are two cycles, (1->2->1), and (3->4->5->3). Note that each cycle can be fixed in (#elements_in_cycle - 1) steps. So you can fix the entire string in \sum_cycles{#elements - 1}, which is n-#cycles – Subhasis Das May 16 '14 at 05:37
@DavidEisenstat, I am not sure of that specific case. Let me know if you find out anything. – Subhasis Das May 16 '14 at 05:40
1

@j_random_hacker You are right. The algorithm for the cycle case is essentially equivalent to finding out strongly connected components, which is O(n). Modified my answer for that. – Subhasis Das May 16 '14 at 05:43
1

@SubhasisDas: My confusion was my own fault in the end -- I was somehow thinking that G was undirected. And I'm glad my (incorrect) guess about the time bound led to an unrelated improvement :) – j_random_hacker May 16 '14 at 05:49

score 0 · Answer 3 · answered Aug 17 '13 at 20:39

0

Do an A* search (see http://en.wikipedia.org/wiki/A-star_search_algorithm for an explanation) for the shortest path through the graph of equivalent strings from one string to the other. Use the Levenshtein distance / 2 as your cost heuristic.

answered Aug 17 '13 at 20:39

btilly

43,296
3
59
88

Finding the minimum number of swaps to convert one string to another, where the strings may have repeated characters

3 Answers3

Linked

Related