Combining pairs in a string (Matlab)

Question

I have a string:

 sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'

How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string. The expected output should be:

 S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1

 S2 = 'DFCEABG'

If I have AB, BC and BD, the generated strings should be both : ABC and ABD . If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .

What have you tried so far ? Your problem seems to be quite lacking in precisions ... What can you do if there are multiple choices available ? If you have AB, BC and BD, how could you possibly know which one choose ? Can't you simplify your problem somehow ? — Peut22, Aug 13 '15 at 06:21
If I have AB, BC and BD, the generated strings should be both : ABC and ABD . — kgk, Aug 13 '15 at 09:14
But if the string starting as 'ABC' ends up not containing all of the letters while ABD does ? How do you plan to know that ? I'm afraid you won't have many answers if you don't change the way you're presenting your problem. — Peut22, Aug 13 '15 at 11:46
Your question lacks sufficient examples to properly give you an answer. Going by your logic, you would merge `AB` and `BC` together so that you have `AB` then `BC BD` after that point. Please add more examples so that we can derive a general rule. The last statement in your problem suggests that there are exceptions to your initial statement, and that's unfortunately going to make any solution unwieldy. — rayryeng, Aug 13 '15 at 13:59

beaker · Accepted Answer · 2017-01-10T05:16:59.823

This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.

First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:

sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);

The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:

>> full(A)
ans =
   0   0   0   0   0   0   0   0
   0   0   1   0   0   1   0   0
   0   1   0   0   0   0   0   1
   0   0   0   0   0   1   0   1
   0   1   0   0   0   0   1   0
   0   1   0   0   0   0   1   0
   0   0   0   1   0   0   0   0
   0   0   0   0   1   0   0   0

As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.

Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:

treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');

Output:
S1 = BAEFCGD

Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.

function full_paths = dft_all(A, current_path)
   % A - adjacency matrix of graph
   % current_path - initially just the start node (root)
   % full_paths - cell array containing all paths from initial root to a leaf

   n = size(A, 1);   % number of nodes in graph 
   full_paths = cell(1,0);   % return cell array 

   unvisited_mask = ones(1, n);
   unvisited_mask(current_path) = 0;   % mask off already visited nodes (path)
   % multiply mask by array of nodes accessible from last node in path
   unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);

   % add restriction on length of paths to keep (numel == n)
   if isempty(unvisited_nodes) && (numel(current_path) == n)
      full_paths = {current_path};   % we've found a leaf node
      return;
   end

   % otherwise, still more nodes to search
   for node = unvisited_nodes
      new_path = dft_all(A, [current_path node]);   % add new node and search
      if ~isempty(new_path)   % if this produces a new path...
         full_paths = {full_paths{1,:}, new_path{1,:}};   % add it to output 
      end
   end

end

This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:

   if isempty(unvisited_nodes) && (numel(current_path) == n)

The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.

I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:

A =
   0   1   1   1   1   1   1   1
   0   0   1   0   0   1   0   0
   0   1   0   0   0   0   0   1
   0   0   0   0   0   1   0   1
   0   1   0   0   0   0   1   0
   0   1   0   0   0   0   1   0
   0   0   0   1   0   0   0   0
   0   0   0   0   1   0   0   0

If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.

Here's the output of the function using the updated matrix:

>> dft_all(A, 1)
ans =
{
  [1,1] =

     1   2   3   8   5   7   4   6

  [1,2] =

     1   3   2   6   7   4   8   5

  [1,3] =

     1   3   8   5   2   6   7   4

  [1,4] =

     1   3   8   5   7   4   6   2

  [1,5] =

     1   4   6   2   3   8   5   7

  [1,6] =

     1   5   7   4   6   2   3   8

  [1,7] =

     1   6   2   3   8   5   7   4

  [1,8] =

     1   6   7   4   8   5   2   3

  [1,9] =

     1   7   4   6   2   3   8   5

  [1,10] =

     1   8   5   7   4   6   2   3

}

@kgk You would simply perform a second DFS starting from `D`. There is a way to generate all of the possible strings at once, which is apparently what I had in mind when I started the node numbering from `2` leaving `1` available as a dummy node, but it would require modifying the DFS function to "forget" previously visited nodes as it backtracked up the tree. — beaker, Sep 13 '15 at 13:46
Does Matlab have the DFS function? Or May I try a DFS code into this program? — kgk, Sep 29 '15 at 14:10
@kgk There is [`graphtraverse`](http://www.mathworks.com/help/bioinfo/ref/graphtraverse.html) in the Bioinformatics Toolbox. If you don't have the Bioinformatics Toolbox, there are several versions on Mathworks File Exchange. — beaker, Sep 29 '15 at 15:20
Thanks ! I saw a dfsearch function in R2015b. However, I can not implement it in R2014b. — kgk, Sep 29 '15 at 15:44
I think one of our users implemented DFS for an answer on another post. Let me look. — beaker, Sep 29 '15 at 15:46
@kgk Here it is: http://stackoverflow.com/questions/26332883/how-to-find-all-connected-components-in-a-binary-image-in-matlab/26333226#26333226. I haven't tried the code, but it is certainly worth a try. — beaker, Sep 29 '15 at 15:50
@ That code runs well. However, I don't know how can we get something like treepath=[3,2,6,7,4,8,5] from the paths in the code in that link ? — kgk, Jan 10 '17 at 02:31
@kgk You're right, that implementation is geared more toward connected components than identifying paths. I'm pretty sure I have a dfs implementation lying around, let me have a look. — beaker, Jan 10 '17 at 03:03
Just to be clear, the paths you want are the ones that visit every node exactly once, right? So each path should contain `'A':'G'`. — beaker, Jan 10 '17 at 03:11
Yes, I would like fine all of the paths which contain 'A':'G'. Thanks — kgk, Jan 10 '17 at 03:22
The implementation I had didn't keep all the paths, it just created the depth-first tree, so I implemented a recursive version. That was the easiest way to have it "forget" that it had previously visited nodes in another path. I haven't had time to do extensive testing, but on basic data sets it works properly. Let me know if you have any problems. — beaker, Jan 10 '17 at 05:13

Combining pairs in a string (Matlab)

1 Answers1

Linked