0

I have many string entries (this are namespace/class trees) that look like the following:

appsystem
appsystem.applications
appsystem.applications.APPactivities
appsystem.applications.APPmanager
appsystem.applications.APPmodels
appsystem.applications.MAPmanager
appsystem.applications.MAPmanager.maphub
appsystem.applications.MAPmanager.mapmanager
appsystem.applications.pagealertsmanager
appsystem.authentication
appsystem.authentication.manager
appsystem.authentication.manager.encryptionmanager
appsystem.authentication.manager.sso
appsystem.authentication.manager.tokenmanager

But, I need the final output to be like:

{
    "name": "appsystem",
    "children": [
        {
        "name": "applications",
        "children": [
            {"name": "APPactivities"},
            {"name": "APPmanager"},
            {"name": "APPmodels"},
            {"name": "MAPmanager",
                "children": [
                    {"name": "maphub"},
                    {"name": "mapmanager"}

                ]},
            {"name": "pagealertsmanager"}   
            ]
        },
        {
        "name": "authentication",
        "children": [
            {"name": "manager",
                "children": [
                    {"name": "encryptionmanager"},
                    {"name": "sso"},
                    {"name": "tokenmanager"}
                ]}
            ]
        }
    ]
}

The total nodes can be any number.

I am assuming I am going to need recursion but I am at a loss on where even to begin.

wergeld
  • 14,332
  • 8
  • 51
  • 81
  • I answered, but I'm still downvoting because you've just gone "here's a problem, I need this". – TessellatingHeckler Oct 26 '16 at 18:35
  • @TessellatingHeckler, Yea, I know. I was not able to find examples of anything like this - most JSON object examples are reading a JSON item into PS. Was at my wit's end. – wergeld Oct 26 '16 at 18:42
  • I've seen a very similar question on here, one I tried to answer but couldn't get working at the time - other people did answer, which might be a useful reference - but I can't find it now, too many results for 'powershell' and 'json'. – TessellatingHeckler Oct 26 '16 at 19:13

2 Answers2

3

This builds up nested lists, PowerShell ConvertTo-JSON flattens the outer list.

You can change the $Line in $s to $line in (Get-Content input.txt).

But I think this does it:

$s = @'
appsystem
appsystem.applications
appsystem.applications.APPactivities
appsystem.applications.APPmanager
appsystem.applications.APPmodels
appsystem.applications.MAPmanager
appsystem.applications.MAPmanager.maphub
appsystem.applications.MAPmanager.mapmanager
appsystem.applications.pagealertsmanager
appsystem.authentication
appsystem.authentication.manager
appsystem.authentication.manager.encryptionmanager
appsystem.authentication.manager.sso
appsystem.authentication.manager.tokenmanager
'@ -split "`r`n"

$TreeRoot = New-Object System.Collections.ArrayList

foreach ($Line in $s) {

    $CurrentDepth = $TreeRoot

    $RemainingChunks = $Line.Split('.')
    while ($RemainingChunks)
    {

        # If there is a dictionary at this depth then use it, otherwise create one.
        $Item = $CurrentDepth | Where-Object {$_.name -eq $RemainingChunks[0]}
        if (-not $Item)
        {
            $Item = @{name=$RemainingChunks[0]}
            $null = $CurrentDepth.Add($Item)
        }

        # If there will be child nodes, look for a 'children' node, or create one.
        if ($RemainingChunks.Count -gt 1)
        {
            if (-not $Item.ContainsKey('children'))
            {
                $Item['children'] = New-Object System.Collections.ArrayList
            }

            $CurrentDepth = $Item['children']
        }

        $RemainingChunks = $RemainingChunks[1..$RemainingChunks.Count]
    }
}

$TreeRoot | ConvertTo-Json -Depth 1000

Edit: It's too slow? I tried some random pausing profiling and found (not too surprisingly) that it's the inner nested loop, which searches children arrays for matching child nodes, which is being hit too many times.

This is a redesigned version which still builds the tree, and this time it also builds a TreeMap hashtable of shortcuts into the tree, to all the previously build nodes, so it can jump right too them instead of searching the children lists for them.

I made a testing file, some 20k random lines. Original code processed it in 108 seconds, this one does it in 1.5 seconds and the output matches.

$TreeRoot = New-Object System.Collections.ArrayList
$TreeMap = @{}

foreach ($line in (Get-Content d:\out.txt)) {

    $_ = ".$line"    # easier if the lines start with a dot

    if ($TreeMap.ContainsKey($_))    # Skip duplicate lines
    { 
        continue
    }

    # build a subtree from the right. a.b.c.d.e -> e  then  d->e  then  c->d->e
    # keep going until base 'a.b' reduces to something already in the tree, connect new bit to that.
    $LineSubTree = $null
    $TreeConnectionPoint = $null

    do {
        $lastDotPos = $_.LastIndexOf('.')
        $leaf = $_.Substring($lastDotPos + 1)
        $_ = $_.Substring(0, $lastDotPos)

        # push the leaf on top of the growing subtree
        $LineSubTree = if ($LineSubTree) {
                           @{"name"=$leaf; "children"=([System.Collections.ArrayList]@($LineSubTree))}
                       } else { 
                           @{"name"=$leaf}
                       }

        $TreeMap["$_.$leaf"] = $LineSubTree

    } while (!($TreeConnectionPoint = $TreeMap[$_]) -and $_)


    # Now we have a branch built to connect in to the existing tree
    # but is there somewhere to put it?
    if ($TreeConnectionPoint)
    {
        if ($TreeConnectionPoint.ContainsKey('children'))
        {
            $null = $TreeConnectionPoint['children'].Add($LineSubTree)
        } else {
            $TreeConnectionPoint['children'] = [System.Collections.ArrayList]@($LineSubTree)
        }
    } else
    {           # nowhere to put it, this is a new root level connection
        $null = $TreeRoot.Add($LineSubTree)
    }
}

$TreeRoot | ConvertTo-Json -Depth 100

(@mklement0's code takes 103 seconds and produces a wildly different output - 5.4M characters of JSON instead of 10.1M characters of JSON. [Edit: because my code allows multiple root nodes in a list which my test file has, and their code does not allow that])


Auto-generated PS help links from my codeblock (if available):

Community
  • 1
  • 1
TessellatingHeckler
  • 27,511
  • 4
  • 48
  • 87
  • Okay! This is cool. I have a lot to read up on to digest this. I have not used PS to this extent before. Did initial test with sample file and this appears to have spit out what was needed (ie - no crashes in d3js page). – wergeld Oct 26 '16 at 18:43
  • Nicely done; the `ConvertTo-Json -Depth` value seems to be capped at a `100`, though (at least in PS v5.1), so `1000` won't work. – mklement0 Oct 27 '16 at 04:55
  • 1
    @mklement0 I see the error now in PS v5. It's not there in PSv4, `1000` runs without complaint. (I didn't test if it works to a depth of 1000) – TessellatingHeckler Oct 27 '16 at 05:40
  • Not sure if 1000 is needed for me at least. I added - Compress because I have no need for pretty output as well. – wergeld Oct 28 '16 at 13:43
  • Interesting findings, and great idea to use a hashtable to speed things up (though it undoubtedly adds complexity). I've since amended my solution to handle multiple root nodes as well, and I've applied some optimizations, albeit without introducing auxiliary data structures. I guess the timing will depend a lot on the specifics of the input tree structure (number of sibling nodes on each level). – mklement0 Oct 30 '16 at 06:18
  • 2
    @TessellatingHeckler, your second version is now the reigning champion. It read through 129k endpoint nodes in under 1 minute. More than satisfactory for our purposes. – wergeld Oct 31 '16 at 11:50
2

To complement TessellatingHeckler's great answer with an alternative implementation that uses a recursive function.

The emphasis is on modularity and terseness, not performance.[1]

# Outer function that loops over all paths and builds up a one or more nested
# hashtables reflecting the path hierarchy, which are converted to JSON on output.
# Note that only a single JSON object is output if all paths share the same root
# component; otherwise, a JSON *array* is output.
function convert-PathsToNestedJsonObject([string[]] $paths) {
  $hts = New-Object Collections.ArrayList
  $paths.ForEach({
    $rootName = $_.split('.')[0] 
    $ht = $hts.Where({ $_.name -eq $rootName }, 'First')[0]
    if (-not $ht) { [void] $hts.Add(($ht = @{})) }
    convert-PathToNestedHashtable $ht $_ 
  })
  $hts | ConvertTo-Json -Depth 100
}

# Recursive helper function that takes a path such as "appsystem.applications"
# and converts it into a nested hashtable with keys "name" and "children" to
# reflect the path hierarchy. 
function convert-PathToNestedHashtable([hashtable] $ht, [string] $path) {
  $name, $childName, $rest = $path -split '\.', 3
  $ht.name = $name
  if ($childName) {
    if ($ht.children) { 
      $htChild = $ht.children.Where({ $_.name -eq $childName }, 'First')[0]
    } else {
      $ht.children = New-Object Collections.ArrayList
      $htChild = $null
    }
    if (-not $htChild) {      
      [void] $ht.children.Add(($htChild = @{}))
    }
    convert-PathToNestedHashtable $htChild "$childName.$rest" 
  }
}

# Call the outer function with the input paths (assumed to be stored in $paths).
convert-PathsToNestedJsonObject $paths

[1] One deliberate type of optimization is applied, which, however, still keeps the code terse:

PSv4+ offers the (little-known) array methods .ForEach() and .Where(), which are not only noticeably faster than their cmdlet counterparts ForEach-Object and Where-Object, but also offer additional features.

Specifically:

  • $paths.ForEach({ ... }) is used instead of
    $paths | ForEach-Object { ... }

  • $ht.children.Where({ $_.name -eq $childName }, 'First')[0] is used instead of
    $ht.children | Where-Object { $_.name -eq $childName } | Select-Object -First 1

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Interesting. Will test this today. TessellatingHeckler's code works but is slow. Took about 1.5 hours to generate (we have a very large enterprise application base). – wergeld Oct 28 '16 at 12:32
  • Okay, this one is much, much faster. It takes about 10 minutes to generate the JSON vs 1.5+ hours. – wergeld Oct 28 '16 at 14:01
  • @wergeld That's a nice surprise, thanks for letting us know. There a definitely ways to make this faster still. – mklement0 Oct 28 '16 at 14:06
  • 1
    @wergeld I've edited my answer with a faster redesign. On a test file it drops from 108 seconds to 1.5 seconds, so I'm curious what it does on your data and if it produces the same output. @mklement0 That is very elegant! Nice. I was trying to think how to do a recursive solution and didn't come up with anything. I'm surprised you aren't hitting the same performance as my first version because `$htChild = $ht.children | ? { $_.name -eq $childName }` was the pattern where I thought mine was really slowing down. (but in my test you are getting similar performance, which is a bit odd). – TessellatingHeckler Oct 28 '16 at 22:57
  • @TessellatingHeckler: Thanks. I've optimized the `? { $_.name -eq $childName }` somewhat, using the `.Where()` extension method with mode `First` (see my update), though in the interest of brevity I didn't introduce a helper date structure (a hashtable to keep track of sibling names) as you did - kudos for that; sounds like it speeds up things drastically. – mklement0 Oct 30 '16 at 06:15