Sorting Algorithm for XML attribute mutual value pairs

Question

Premise:

I am trying to find or rather come up with an algorithm to parse several XML files and extract a launch sequence saved in FROM=XX, and TO=YY node attributes.

There are hundreds of records and can be split into pairs
Each would have FROM and TO value
Each pair's TO value is an indicator that there is a new pair having it as a FROM value
so the pairs would chain and create continuous set of FROM-TO values.

The tricky part is, that the pairing can split, branch, into multiple ones and join at certain point.

The XML:

<FLOW>
    <TASK FROM ="B" TO ="C"/>
    <TASK FROM ="C" TO ="E1"/>
    <TASK FROM ="C" TO ="F1"/>
    <TASK FROM ="A1" TO ="A2"/>
    <TASK FROM ="A2" TO ="A3"/>
    <TASK FROM ="A3" TO ="C"/>
    <TASK FROM ="C" TO ="D1"/>
    <TASK FROM ="D2" TO ="D3"/>
    <TASK FROM ="D1" TO ="D2"/>
    <TASK FROM ="E1" TO ="E2"/>
    <TASK FROM ="Start" TO ="B"/>
    <TASK FROM ="E2" TO ="E3"/>
    <TASK FROM ="F1" TO ="F2"/>
    <TASK FROM ="F2" TO ="F3"/>
    <TASK FROM ="F3" TO ="G"/>
    <TASK FROM ="Start" TO ="A1"/>
    <TASK FROM ="G" TO ="Finish"/>
    <TASK FROM ="E3" TO ="G"/>
    <TASK FROM ="D3" TO ="G"/>
</FLOW>

I can help to visualise that like the following diagram:

    Start
  /        \  
 [A1]       |
  |         |
 [A2]      [B]
  |         |
 [A3]       |
   \       /
      [C]
   /   |    \ 
[D1]  [E1]  [F1]      
 |     |     |
[D2]  [E2]  [F2]
 |     |     |
[D3]  [E3]  [F3]
 \     |    /
      [G]
       |
     Finish

Desired output:

Start, A1, A2, A3, B, C, D1, D2, D3, E1, E2, E3, F1, F2, F3, G, Finish

Problem:

I have this code running but I can't get it to work with correct order and to overcome the splits.

<# 
    INIT Values, starting pair is always considered as combination of tasks where FROM is 'Start'
    All task are loaded in pairs, and the sequence begining is assembled.
#>

$Counter = [int]1
$Pairs = $File.SelectNodes('/FLOW/TASK[@FROM="Start"]')
$Tasks = $File.SelectNodes("/FLOW/TASK")

$Sequence = @()
ForEach ($Pair in $Pairs) {
    $Sequence += $Pair.TOTASK
}

<#
    Scan the tasks and find the matching pair for initial task pair, save it to the output array. 
#>

Do {
    ForEach ($Task in $Tasks) {
        ## Main loop counter, on matching pairs
        If ($Pairs.FROM -eq $Task.FROM) {
               $Counter++ 
        }
        ## Find current pair's TO in task and flag it as next pair 
        If ($Pairs.TO -eq $Task.FROM) {
            $NextTask = $Task.FROM
            $NextPairs = $File.SelectNodes("/FLOW/TASK[@FROM='$NextTask']")
            $Sequence += $Task.TO
        }
    }

    ## Set new pair to be evaluated
    $Pairs = $NextPairs
} 
While ($Counter -le $Tasks.Count)

What decides whether "A1, A2, A3" or "B" comes first in the output? Can the output be also [Start, A1, B, A2, A3, C, ...]? — astraujums, Feb 15 '18 at 17:25
Yes, in fact that are parallel tasks, it can go like that. The rule is that A3 and B has to be before C. The sorting of desired output was merely 'logical' sequence, as if I would first finish one branch, then join another. Although, the loops will interpret that whatever comes first. — JimB, Feb 15 '18 at 18:27

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Granted it took me a while to come up with an optimized solution for you. Here goes:

Solution (Graphs!)

Ok, why don't you approach this the same way you would in an OOP language:

# Use a hashtable for O(1) lookup for a node by name
$Script:NodeTracker = @{}
class TaskNode {
    #==PROPS==================================================|
    [System.Collections.ArrayList]$then = @()
    [String] $Val
    [Bool]$Visited = $false
    [Collections.ArrayList]$Parent = @()
    #==CONSTRUCTORS===========================================|
    TaskNode([String]$Val) {$this.constructor($Val, $null)}
    TaskNode([String]$Val, [TaskNode]$Parent) {$this.constructor($Val, $Parent)}
    hidden constructor([String]$Val, [TaskNode]$Parent) {
        $This.Val = $Val
        if ($Parent) {$This.Parents.Add($Parent)}
        $Script:NodeTracker.$Val = $This
    }

    #==METHODS================================================|
    [TaskNode] To([String]$Val) {
        $Node = $Script:NodeTracker.$Val

        # If node doesn't exist, create and track
        if (!$Node) {$Node = New-Object TaskNode $Val}
        $This.then.Add($Node)
        $Node.Parent.Add($This)
        return $Node
    }
    [String] ToString() {return "$($this.val)-$(if($this.Visited){'✓'}else{'✘'})"}
    static [String] ReverseModdedBFS([Collections.Queue]$Queue) {
        $Output = ""
        [Collections.Queue]$NextQueue = @()
        do {
            while ($Queue.Count) {
                $Root = $Queue.Dequeue()
                # Write-Host "Root: $Root | Queue: [$Queue] | Next-Queue [$NextQueue] | Non-Visited [$($Root.then | {!$_.Visited})]"
                if ($Root.Visited) {continue}
                if ($Root.Then.Count -gt 1 -and ($Root.then | {!$_.Visited})) {$NextQueue.Enqueue($Root);continue}
                $Root.Visited = $true
                $Output += ','+$Root.Val
                $Root.Parent | % {
                    # Write-Host "    Enqueuing $_"
                    $Queue.Enqueue($_)
                }
            }
            If ($Queue.Count -eq 1 -and !$Queue.Peek().Parent.count) {break}
            $Queue = $NextQueue
            $NextQueue = @()
        } while($Queue.Count)
        $Out = $Output.Substring(1).split(',')
        [Array]::Reverse($Out)
        return $Out -join ','
    }
    #==STATICS=================================================|
    static [TaskNode] Fetch([String]$Val) {
        $Node = $Script:NodeTracker.$Val
        # If node doesn't exist, create and track
        if (!$Node) {return (New-Object TaskNode $Val)}
        return $Node
    }
    static [TaskNode[]] GetAll() {
        return @($Script:NodeTracker.Values)
    }
    static [TaskNode] GetStart() {
        $Nodes = [TaskNode]::GetAll() | {$_.Parent.count -eq 0}
        if ($Nodes.Count -gt 1) {throw 'There is more than one starting node!'}
        if (!$Nodes.Count) {throw 'There is no starting node!'}
        return @($Nodes)[0]
    }
    static [TaskNode[]] GetEnds() {
        $Nodes = [TaskNode]::GetAll() | {$_.Then.count -eq 0}
        if (!$Nodes.Count) {throw 'There is no ending node!'}
        return @($Nodes)
    }
}

function Parse-Edge($From, $To) {
    # Use the safe static accessor so that it will fetch the singleton instance of the node, or create and return one!
    [TaskNode]::Fetch($From).To($To)
}

function XML-Main([xml]$XML) {
    @($XML.Flow.Task) | % {Parse-Edge $_.From $_.To}
    [TaskNode]::ReverseModdedBFS([TaskNode]::GetEnds())
}

Test for proof!

I tested it like the following:

#Create or Find root node 'O'
$Root = [TaskNode]::Fetch('O')

# Set up Chains! Please draw to test
$root.To('A').To('B').To('C').To('H').To('Z').To('M')
$Root.To('D').To('E').To('C').To('H').To('I').To('M')
$Root.To('F').To('G').To('C').To('H').To('J').To('M')
[TaskNode]::Fetch('C').To('K').To('L').To('M')

# Run BFS!
[TaskNode]::ReverseModdedBFS([TaskNode]::GetEnds())

Output of Test

Root: M-✘ | Queue: [] | Next-Queue [] | Non-Visited []
    Enqueuing Z-✘
    Enqueuing I-✘
    Enqueuing J-✘
    Enqueuing L-✘
Root: Z-✘ | Queue: [I-✘ J-✘ L-✘] | Next-Queue [] | Non-Visited []
    Enqueuing H-✘
Root: I-✘ | Queue: [J-✘ L-✘ H-✘] | Next-Queue [] | Non-Visited []
    Enqueuing H-✘
Root: J-✘ | Queue: [L-✘ H-✘ H-✘] | Next-Queue [] | Non-Visited []
    Enqueuing H-✘
Root: L-✘ | Queue: [H-✘ H-✘ H-✘] | Next-Queue [] | Non-Visited []
    Enqueuing K-✘
Root: H-✘ | Queue: [H-✘ H-✘ K-✘] | Next-Queue [] | Non-Visited []
    Enqueuing C-✘
    Enqueuing C-✘
    Enqueuing C-✘
Root: H-✓ | Queue: [H-✓ K-✘ C-✘ C-✘ C-✘] | Next-Queue [] | Non-Visited []
Root: H-✓ | Queue: [K-✘ C-✘ C-✘ C-✘] | Next-Queue [] | Non-Visited []
Root: K-✘ | Queue: [C-✘ C-✘ C-✘] | Next-Queue [] | Non-Visited []
    Enqueuing C-✘
Root: C-✘ | Queue: [C-✘ C-✘ C-✘] | Next-Queue [] | Non-Visited []
    Enqueuing B-✘
    Enqueuing E-✘
    Enqueuing G-✘
Root: C-✓ | Queue: [C-✓ C-✓ B-✘ E-✘ G-✘] | Next-Queue [] | Non-Visited []
Root: C-✓ | Queue: [C-✓ B-✘ E-✘ G-✘] | Next-Queue [] | Non-Visited []
Root: C-✓ | Queue: [B-✘ E-✘ G-✘] | Next-Queue [] | Non-Visited []
Root: B-✘ | Queue: [E-✘ G-✘] | Next-Queue [] | Non-Visited []
    Enqueuing A-✘
Root: E-✘ | Queue: [G-✘ A-✘] | Next-Queue [] | Non-Visited []
    Enqueuing D-✘
Root: G-✘ | Queue: [A-✘ D-✘] | Next-Queue [] | Non-Visited []
    Enqueuing F-✘
Root: A-✘ | Queue: [D-✘ F-✘] | Next-Queue [] | Non-Visited []
    Enqueuing O-✘
Root: D-✘ | Queue: [F-✘ O-✘] | Next-Queue [] | Non-Visited []
    Enqueuing O-✘
Root: F-✘ | Queue: [O-✘ O-✘] | Next-Queue [] | Non-Visited []
    Enqueuing O-✘
Root: O-✘ | Queue: [O-✘ O-✘] | Next-Queue [] | Non-Visited []
Root: O-✓ | Queue: [O-✓] | Next-Queue [] | Non-Visited []
Root: O-✓ | Queue: [] | Next-Queue [] | Non-Visited []
O,F,D,A,G,E,B,C,K,H,L,J,I,Z,M

Explanation, and Algo

We use a graph to plot all edges to each other using some nifty OOP tricks. Then we traverse the graph in reverse from all sink nodes (ie. nodes that have no children). We keep doing BFS till we hit a node that:

Has more than 1 child
AND, has more than 0 non-visited descendants
If so, add that for the next round of BFS!

Repeat this till your current and future queues are empty, in which case, your output is now complete. Now:

Split by comma
Reverse array (since we did a reverse traversal)
Print output!

That looks interesting, BFS is definitely the type of sorting I was looking for. I will try that out and see how that behaves and I will play with the joining branches at certain point. — JimB, Feb 16 '18 at 11:48
Sures thing! I made the API pretty intuitive, so you can chain the `.To('')`'s to create chains of whatever sizes. As they all internally use the singleton pattern. It should also work for N random forks and varying depths. — AP., Feb 16 '18 at 16:52
@JimB if this worked for you can you please upvote and mark as Answer? A lot of time was put into creating this solution — AP., Feb 19 '18 at 21:04
Sure, I do that, I am trying to adapt this solution to the specifics of my case, but I think I am on a good way! Thank you. — JimB, Feb 21 '18 at 07:41

G42 · Answer 2 · 2018-02-15T23:24:33.787

This script will print out what you want. It's self-contained so you can copy it all and run it. There's room for improvement!

One thing: This assumes one split at a time, like in your sample. If this scenario is possible, more logic is required:

 |     |       |
 |     |       |
[D3]  [E3]   [F3]
 \     | \    /
  \    | [H] /              - spit from C is not resolved, F1, F2 and F3 not handled
   \   | /  /  
      [G]
       |
     Finish

$script:xml = [xml] @"
<FLOW>
    <TASK FROM ="B" TO ="C"/>
    <TASK FROM ="C" TO ="E1"/>
    <TASK FROM ="C" TO ="F1"/>
    <TASK FROM ="A1" TO ="A2"/>
    <TASK FROM ="A2" TO ="A3"/>
    <TASK FROM ="A3" TO ="C"/>
    <TASK FROM ="C" TO ="D1"/>
    <TASK FROM ="D2" TO ="D3"/>
    <TASK FROM ="D1" TO ="D2"/>
    <TASK FROM ="E1" TO ="E2"/>
    <TASK FROM ="Start" TO ="B"/>
    <TASK FROM ="E2" TO ="E3"/>
    <TASK FROM ="F1" TO ="F2"/>
    <TASK FROM ="F2" TO ="F3"/>
    <TASK FROM ="F3" TO ="G"/>
    <TASK FROM ="Start" TO ="A1"/>
    <TASK FROM ="G" TO ="Finish"/>
    <TASK FROM ="E3" TO ="G"/>
    <TASK FROM ="D3" TO ="G"/>
</FLOW>
"@

Function GetNextNode {
    param($node)

    $nextNode = $xml.FLOW.TASK |
                    Where-Object {$_.FROM -eq $node.TO} |
                    Sort-Object TO
    return $nextNode

}

Function GetPrevNode {
    param($node)

    $nextNode = $xml.FLOW.TASK |
                    Where-Object {$_.TO -eq $node.FROM} |
                    Sort-Object TO
    return $nextNode

}

$startNode   = $xml.FLOW.TASK | Where-Object {$_.FROM -eq "Start"} | Sort-Object TO

do{
    # deal with the start node as it's a special case
    if(-not [string]::IsNullOrEmpty($startNode)){

        # start node is the start of the chain
        [array]$mainChain = $startNode[0]

        # if you have two or more nodes, store them for now. They will be referenced later
        if($startNode.Count -gt 1){
            [array]$splitterNode  = $startNode
        }

        # take the first start node and find out which node it leads to
        [array]$nextNode = GetNextNode -node $startNode[0]

        # add one of the nodes. set $oldNode for next iteration
        [array]$mainChain += $nextNode[0]
        [array]$oldNode  = $nextNode[0]

        # this is only for the first node, don't run through again
        $startNode = $null
        continue

    }


    # get the next node AND previus nodes for that next node
    [array]$nextNode   = GetNextNode -node $oldNode[0]
    [array]$prevNode   = GetPrevNode -node $nextNode[0]

    if($prevNode.Count -ne 1){

        # if there are multiple previous nodes, go back and deal with them
        # to do this, go back to the $splitterNode
        if(-not [string]::IsNullOrEmpty($splitterNode)){

            # exclude anything already added
            [array]$remainingNodes = $splitterNode | Where-Object {$_ -notin $mainChain}

            # if that leaves only one node, others have been dealt with
            # there is no longer a split
            if($remainingNodes.Count -eq 1){               
                $splitterNode = $null
            }

            [array]$oldNode = $remainingNodes[0]
            $mainChain += $remainingNodes[0]
            continue

        }else{

            # if there is no $splitterNode, all previous nodes should already be in the chain
            # check
            foreach($node in $prevNode){
                if($node -notin $mainChain){
                    Write-Host "Broken chain"
                    Exit
                }
            }

        }
    }

    # if this node is a splitter, set it so it can be referenced later
    if($nextNode.Count -gt 1){
        $splitterNode = $nextNode
    }

    # add this node to the chain. 
    [array]$mainChain += $nextNode[0]
    [array]$oldNode = $nextNode[0]

}while($oldNode.To -ne "Finish")

($mainChain.FROM | Select-Object -Unique) + "Finish" | Out-Host

Thank you, this is easier for me to read. I will examine the options and how to treat all possibilities. I.e. the ones you suggested, or when the Start chains and splits later on n-th node, rather then on the beginning etc. — JimB, Feb 16 '18 at 12:08

astraujums · Answer 3 · 2018-03-04T18:19:47.897

Your data seem to be subject to partial order and having the greatest and least elements. What you need is a topological sort.

One way to do it (in C++) is below. This is depth-first-search, compared to breadth-first-search in the accepted answer. This might be easier to read.

struct Node
{
    Node(string name)
    {
        this->name = name;
        visited = false;
    }
    string name;
    deque<Node*> next;
    bool visited;
};

void visit(Node* node, deque<Node*>& sorted_nodes)
{
    if (node->visited)
    {
        return;
    }
    for (auto n : node->next)
    {
        visit(n, sorted_nodes);
    }
    node->visited = true;
    sorted_nodes.push_front(node);
}

deque<Node*> serialize_dag(Node* root)
{
    deque<Node*> sorted_nodes;    
    visit(root, sorted_nodes);
    return sorted_nodes;
}