1

During synchronization of data between two systems, the data needs to be processed in a specific order due to dependencies in the target system. Data retrieved from the source system is not ordered in the needed order.

Using Sort-Object to sort alphabetically won`t work in this case:

$a | Sort-Object

Hence, a regular expression which extracts letters and numbers to use for sorting purposes is needed.

Sample data illustrating the data structure:

$a = @('K1 Lorem ipsum','K2-2-2-1 Lorem ipsum','K1-1 Lorem ipsum','K2-7 Lorem ipsum','K1-1-1 Lorem ipsum','K1-4-2-8 Lorem ipsum','K2a Lorem ipsum','K2a-1 Lorem ipsum','K2a-1-1 Lorem ipsum','K2a-1-1-1 Lorem ipsum','K2-2-2-2 Lorem ipsum')

$a | Sort-Object

Required output:

  • K1 Lorem ipsum
  • K1-1 Lorem ipsum
  • K1-1-1 Lorem ipsum
  • K1-4-2-8 Lorem ipsum
  • K2-2-2-1 Lorem ipsum
  • K2-2-2-2 Lorem ipsum
  • K2a Lorem ipsum
  • K2a-1 Lorem ipsum
  • K2a-1-1 Lorem ipsum
  • K2a-1-1-1 Lorem ipsum
  • K2-7 Lorem ipsum

Actual output:

  • K1 Lorem ipsum
  • K1-1 Lorem ipsum
  • K1-1-1 Lorem ipsum
  • K1-4-2-8 Lorem ipsum
  • K2-2-2-1 Lorem ipsum
  • K2-2-2-2 Lorem ipsum
  • K2-7 Lorem ipsum
  • K2a Lorem ipsum
  • K2a-1 Lorem ipsum
  • K2a-1-1 Lorem ipsum
  • K2a-1-1-1 Lorem ipsum

Any pointers to get started with the regular expression needed to create a custom object for sorting purposes would be appreciated.

Update 2019-10-10:

Thanks guys, and sorry for the lack of information.

What I had tried before posting, but did not include as I felt I was going into a wrong direction:

$a = @('K1 Lorem ipsum','K2-2-2-1 Lorem ipsum','K1-1 Lorem ipsum','K2-7 Lorem ipsum','K1-1-1 Lorem ipsum','K1-4-2-8 Lorem ipsum','K2a Lorem ipsum','K2a-1 Lorem ipsum','K2a-1-1 Lorem ipsum','K2a-1-1-1 Lorem ipsum','K2-2-2-2 Lorem ipsum')

Foreach($b in $a){
    $null = $b.Name -match '(\D+)(\d+)'
    [PSCustomObject]@{
        Original = $b.Name
        Letters = $Matches[1]
        Number1 = [int]$Matches[2]
    }
}

I`ll make sure to include it anyway the next time.

Here is a screenshot from the actual source structure (a Sharepoint Term Store).

Required output should be the same as the screenshot:

  • K1 Lorem ipsum
  • K1-1 Lorem ipsum
  • K1-1-1 Lorem ipsum
  • K1-4-2-8 Lorem ipsum
  • K2a Lorem ipsum
  • K2-5 Lorem ipsum
  • K2-5a Lorem ipsum
  • K2-5b Lorem ipsum
  • K2-5c Lorem ipsum
  • K2-5c-1 Lorem ipsum
  • K2-5c-2a Lorem ipsum
  • K2-5c-2b Lorem ipsum
  • K2-5c-2b-1 Lorem ipsum
  • K2-5c-2b-2 Lorem ipsum
  • K2-6 Lorem ipsum
  • K2-7 Lorem ipsum
  • K2b Lorem ipsum
  • K2b-1 Lorem ipsum
  • K2b-1-1 Lorem ipsum
  • K2b-1-2 Lorem ipsum

The same list in PowerShell (randomized)

$a = @(

'K2-5c-2b-1 Lorem ipsum'
'K1-1-1 Lorem ipsum'
'K1-4-2-8 Lorem ipsum'
'K2-5b Lorem ipsum'
'K2-7 Lorem ipsum'
'K2a Lorem ipsum'
'K2-5 Lorem ipsum'
'K2-5c-2b Lorem ipsum'
'K1-1 Lorem ipsum'
'K2-5a Lorem ipsum'
'K1 Lorem ipsum'
'K2-5c Lorem ipsum'
'K2-6 Lorem ipsum'
'K2-5c-1 Lorem ipsum'
'K2b-1-1 Lorem ipsum'
'K2-5c-2a Lorem ipsum'
'K2-5c-2b-2 Lorem ipsum'
'K2b Lorem ipsum'
'K2b-1 Lorem ipsum'
'K2b-1-2 Lorem ipsum'
)

The sorting criterias:

  • Numbers in tokens (such as K2) should be sorted numerically
  • Sort on number followed by letter. For example, K2a should be sorted before K3. Also, K2b should be sorted after K2a, but before K3.
  • Depth matters, for example K2a comes before K2-5.
  • When on the same depth level, numbers without a following letter should come first. For example, K2-5 comes before K2-5a.
  • 1
    [regex-tutorial-a-simple-cheatsheet-by-examples](https://medium.com/factory-mind/regex-tutorial-a-simple-cheatsheet-by-examples-649dc1c3f285) is a good tutorial and [regex101.com](https://regex101.com) is good at testing / experimenting – T-Me Oct 10 '19 at 12:36
  • 2
    I think you should explain what are the criteria for sorting. For example, I have no idea why `K2a-*` is before `K2-7` but after `K2-2-*`. Also, showing what you already tried (meaning you put some effort) usually increases chances for getting better answers. – Robert Dyjas Oct 10 '19 at 12:42
  • 1
    Specifically, please clarify the following: should the numbers in tokens such as `K2` also be sorted numerically? What about `K2a` then? Should sorting account for the number-only tokens (e.g. `1`, `2`, `4`) _individually_, or should something like `4-2-8` be sorted as `428`? With individual sorting, should missing tokens be considered `0`? – mklement0 Oct 10 '19 at 14:55

4 Answers4

0

The exact requirements are unclear as of this writing, but in the spirit of:

Any pointers to get started with the regular expression needed to create a custom object for sorting purposes would be appreciated

the following solution does the following:

  • For sorting purposes, it transforms each input string with number-only components 0-left-padded to a fixed width, so that sorting by the resulting string effectively sorts the embedded number numerically.

    • For instance, input string K1-4-2-8 Lorem ipsum is transformed into K1-0004-0002-0008 Lorem ipsum.

PowerShell Core solution:

$a | Sort-Object { $_ -replace '(?<=-)\d+', { '{0:0000}' -f [int] $_.Value  } }

Adjust the number of 0s for padding based as needed, based on how many digits the numbers can have at most.

If only the first space-separated token is relevant for sorting (e.g., K1-4-2-8, but not Lorem ipsum), you can trim the input string by providing $_ -replace ' .*' instead of just $_ as the LHS of the -replace operation.


In Windows PowerShell, the -replace operator doesn't support a script blocks as the replacement operand, so you have to use the [regex] .NET type directly:

$a | Sort-Object { 
    [regex]::Replace($_, '(?<=-)\d+', { param($m) '{0:0000}' -f [int] $m.Value })
  }
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Note that I have now updated the original post with more information, such as a link to a screenshot of the structure in the source data. Thanks for the starting point mklement0. As you can see in my updated post, depth also matters. However, I have a recursive function which loops through the source data and adds depth information. The last part should then be to add support for sorting per depth level, which should be easy to implement in the recursive function. I`ll work some more on this and let you know how it goes. – Jan Egil Ring Oct 11 '19 at 05:13
0

Your Required output does not follow the given rules, in particular the rule

  • Depth matters, for example K2a comes before K2-5

is not compatible with the desired order

K2a Lorem ipsum
K2-5 Lorem ipsum
K2-6 Lorem ipsum
K2-7 Lorem ipsum
K2b Lorem ipsum

The screenshot shows that the data is organized in a tree structure. Obviously the structure has an intrinsic layout that is impossible to reconstruct just by sorting by whatever rules.

Armali
  • 18,255
  • 14
  • 57
  • 171
  • 1
    I just talked to the business owner for the system, and the K2 level is special with regards to how it is designed. It`s the only level with letters (e.g. K2a, K2b) in the first token, and it also contains some items which should be cross-referenced to both K2a and K2b. Hence, you are right in this being impossible to reconstruct. However, for the other levels the structure is properly defined, and the regex provided by James seems to do the trick. It was ok to skip the K2 level in the automated process due to this situation, which makes it possible to reach the desired output. – Jan Egil Ring Oct 11 '19 at 09:35
0

It is painful because you need to break off each of the terms but this

$regex1=[regex]"(?i)^[a-z]+" 
$regex2=[regex]"(?i)(?<=^[a-z]+)\d+"
$regex3=[regex]"(?i)(?<=^[a-z]+\d+)[a-z]+" 
$regex4=[regex]"(?i)(?<=^[a-z]+\d+[a-z]*-)\d+" 
$regex5=[regex]"(?i)(?<=^[a-z]+\d+[a-z]*-\d+)[a-z]+" 
$regex6=[regex]"(?i)(?<=^[a-z]+\d+[a-z]*-\d+[a-z]*-)\d+"
$regex7=[regex]"(?i)(?<=^[a-z]+\d+[a-z]*-\d+[a-z]*-\d+)." 



$a = @(

'K2-5c-2b-1 Lorem ipsum'
'K1-1-1 Lorem ipsum'
'K1-4-2-8 Lorem ipsum'
'K2-5b Lorem ipsum'
'K2-7 Lorem ipsum'
'K2a Lorem ipsum'
'K2-5 Lorem ipsum'
'K2-5c-2b Lorem ipsum'
'K1-1 Lorem ipsum'
'K7-1 Lorem ipsum'
'K2-5a Lorem ipsum'
'K1 Lorem ipsum'
'K2-5c Lorem ipsum'
'K2-6 Lorem ipsum'
'K2-5c-1 Lorem ipsum'
'K2b-1-1 Lorem ipsum'
'K2-5c-2a Lorem ipsum'
'K2-5c-2b-2 Lorem ipsum'
'K2b Lorem ipsum'
'K2b-1 Lorem ipsum'
'K2b-1-2 Lorem ipsum'
) | sort @{e={$regex1.Match($_).Value}}, @{e={$regex2.Match($_).Value -as [int]} },  
         @{e={$regex3.Match($_).Value}}, @{e={$regex4.Match($_).Value -as [int]} }, 
         @{e={$regex5.Match($_).Value}}, @{e={$regex6.Match($_).Value -as [int]}} ,
         @{e={$regex7.Match($_).Value}}

[edit - first version didn't have as int so 2 digit numbers would not sort properly] produces this

K1 Lorem ipsum
K1-1 Lorem ipsum
K1-1-1 Lorem ipsum
K1-4-2-8 Lorem ipsum
K2-5 Lorem ipsum
K2-5a Lorem ipsum
K2-5b Lorem ipsum
K2-5c Lorem ipsum
K2-5c-1 Lorem ipsum
K2-5c-2a Lorem ipsum
K2-5c-2b-2 Lorem ipsum
K2-5c-2b-1 Lorem ipsum
K2-5c-2b Lorem ipsum
K2-6 Lorem ipsum
K2-7 Lorem ipsum
K2a Lorem ipsum
K2b Lorem ipsum
K2b-1 Lorem ipsum
K2b-1-1 Lorem ipsum
K2b-1-2 Lorem ipsum
K7-1 Lorem ipsum
  • This is awesome James, based on some initial testing it looks good. The only thing I noticed is that K2-5c-2b-2 should come after K2-5c-2b-1. – Jan Egil Ring Oct 11 '19 at 08:41
  • Yes, I only got as a far as testing on the first 7 , so K2-5c-2b-2xxxx all look the same, If you change the . at the end of the last one to [a-z]+ and then add an 8 and 9 extending in the same way as the previous ones did then you should be good - but I'd go with the second one I suggested. – James O'Neill Oct 12 '19 at 19:58
0

After posting that long answer I was went for a shave. And thought like this. "They shouldn't use 5a, 5b, it would all sort out if they used 5-a, 5-b" So this will place "-" over the any spot with a number behind and a letter in front.

sort @{e={$_ -replace "(?<=\d)(?=[a-z])","-"}}

It will handle multiple letters, but multiple digits will be handled incorrectly so this

sort @{e={$_ -replace "(?<=\d)(?=[a-z])","-" -replace "(?<=\D)(?=\d\D|\d$)","0"}}

does the same thing and says if you find a place preceded by a non-digit,and followed by either a digit and a non-digit, or a digit at the end of the string Then put a 0 in that space. if you have triple digits you can add another one which looks for double digits

-replace  "(?<=\D)(?=\d\d\D|\d\d$)","0"