0

Problem Statement

Please take a look at this previous question of mine. I'm trying to achieve something similar to it, but this time with more advanced criteria.

Put simply, I need to add a child nodes (XML tags) under their parents <NETTOTAL>. The child node text content consists of 8-digit numbers extracted from the same XML file. Those numbers are being extracted and stored in an array for later processing as you will see in the script below.

About the Script

The existing script works, but I suspect that the loop logic is wrong. I need it to pick and place one XML tag with it's corresponding 8-digit number under each parent, not pick, loop, and place the same exact child.

Original XML File Contents

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EXPORT>
    <IMPORTMODEL>NEX</IMPORTMODEL>
    <SESSION>1000061</SESSION>
    <CUSTORDERS>
        <RECORD CODE="NX0100103">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>97.40</NETTOTAL>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName1 lastName1 (43700006)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100104">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>38.20</NETTOTAL>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100105">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>63.00</NETTOTAL>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName3 lastName3 (43100014)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100106">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>55.00</NETTOTAL>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
    </CUSTORDERS>
</EXPORT>

Desired Goal

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EXPORT>
    <IMPORTMODEL>NEX</IMPORTMODEL>
    <SESSION>1000061</SESSION>
    <CUSTORDERS>
        <RECORD CODE="NX0100103">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>97.40</NETTOTAL>
            <SALESMAN>43700006</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName1 lastName1 (43700006)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100104">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>38.20</NETTOTAL>
            <SALESMAN>43100015</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100105">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>63.00</NETTOTAL>
            <SALESMAN>43100014</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName3 lastName3 (43100014)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100106">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>55.00</NETTOTAL>
            <SALESMAN>43100015</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
    </CUSTORDERS>
</EXPORT>

Script

$xmlFilesLocation = "C:\XML_dumping"

cd $xmlFilesLocation

$netTotalRegEx = "(<NETTOTAL>\d{1,30}\.\d{1,2}<\/NETTOTAL>)"
$salesManRegEx = "(<SALESMAN>\d{8}<\/SALESMAN>)"

$beginTag = "`t`t`t<SALESMAN>"
$endTag = "</SALESMAN>"

$files = Get-ChildItem -Path $xmlFilesLocation -Filter *.xml

$numberOfFiles = (Get-ChildItem -Path $xmlFilesLocation -Filter *.xml | Measure-Object).Count

# First, loop through all files separately to check if <SALESMAN>[code]</SALESMAN> exists, and skip if true
for ($i=1; $i -le $numberOfFiles; $i++) {
    $content = (Get-Content $files[$i - 1] -Raw)

    # Skip file if <SALESMAN>[code]</SALESMAN> is detected in it
    if ($content -match $salesManRegEx) { break }
}

# Then, loop through all files (again) separately to check if <SALESMAN>[code]</SALESMAN> is missing, and process if true
for ($j=1; $j -le $numberOfFiles; $j++) {
    $content = (Get-Content $files[$j - 1] -Raw)

    # If <SALESMAN>[code]</SALESMAN> is missing in the file
    if ($content -notmatch $salesManRegEx) {
        $contentArray = @()

        # Hold all the content, but split from the brackets
        $contentArray = $content
        $contentArray = $contentArray.Split("()")
        # Now split by line to extract the salesman codes into an array.
        # Example: [43700006, 43100015, 43100014, 43100015]
        $contentArray = $contentArray.Split("")

        for ($k=1; $k -le $contentArray.Length; $k++) {
            # if the salesman code is found...
            if ($contentArray[$k] -match "^\d{8}$") {
                if ($content -notmatch $salesManRegEx) {
                    # Construct the full tag
                    $fullSalesManTag = $beginTag + $contentArray[$k] + $endTag

                    # ...then replace in $content the regular expression with $fullSalesManTag and insert it directly underneath NETTOTAL line
                    $content= [regex]::Replace($content, $netTotalRegEx, ('$1' + "`n" + "$fullSalesManTag"))

                    $content | Out-File -Encoding UTF8 $files[$j - 1]
                }
            }
        }
    }
}

Current Output

The output is clearly showing that it's only adding the last element in the array index. That's when the loop has ended. I understand why this is happening, but I can't wrap my head around a solution to correct the logic.

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<EXPORT>
    <IMPORTMODEL>NEX</IMPORTMODEL>
    <SESSION>1000061</SESSION>
    <CUSTORDERS>
        <RECORD CODE="NX0100103">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>97.40</NETTOTAL>
            <SALESMAN>43700006</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName1 lastName1 (43700006)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100104">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>38.20</NETTOTAL>
            <SALESMAN>43700006</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100105">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>63.00</NETTOTAL>
            <SALESMAN>43700006</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName3 lastName3 (43100014)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
        <RECORD CODE="NX0100106">
            <VATMODE>X</VATMODE>
            <INPUTDATE>26/07/2017</INPUTDATE>
            <NETTOTAL>55.00</NETTOTAL>
            <SALESMAN>43700006</SALESMAN>
            <DOCLINES>
                <LINE>
                    <LINETYPE>M</LINETYPE>
                    <ITEMDESC>Salesperson: firstName2 lastName2 (43100015)</ITEMDESC>
                </LINE>
            </DOCLINES>
        </RECORD>
    </CUSTORDERS>
</EXPORT>
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328
Ugo
  • 159
  • 1
  • 12
  • @kjhughes The question ["Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms"](https://stackoverflow.com/q/6751105/1630171) is certainly related I don't agree that it's a duplicate. This question is about re-inserting information from an XML file as new nodes, but not necessarily about doing it via regular expressions. – Ansgar Wiechers Aug 24 '17 at 19:50
  • @AnsgarWiechers: Anyone trying to parse XML with regex needs to be told to use a real XML parser (as you've done below +1, btw), and further discussion really ought to be squelched because this topic has been beaten to death. I could close against any number of questions, but the one I chose, although not a specific duplicate, is a good general explanation for the general misconception behind this whole class of misguided questions. – kjhughes Aug 24 '17 at 19:59
  • I certainly agree with educating people about why attempts to parse XML with regular expressions is misguided. I just prefer telling people what to *do* in addition to telling them what *not* to do. The question describes a rather straightforward, answerable problem, with an honest (although misguided) attempt to solve it, so I believe an actual answer is going to be more helpful than a mere discussion (however thorough) of why the original approach was misguided. – Ansgar Wiechers Aug 24 '17 at 20:12
  • @AnsgarWiechers: Closing the question doesn't detract from your admirable and helpful answer, but I'll yield and leave this open as you wish. Cheers. – kjhughes Aug 24 '17 at 20:20

1 Answers1

4

Do not parse XML with regex. Every time you do a rainbow unicorn dies.

But seriously, in most cases regular expressions are the wrong tool for working with XML files. If you're interested, the answers to this question (thanks to kjhughes for the link) discuss the issues with the regex approach in depth.

Use a proper XML parser and a couple XPath expressions to extract the salesperson ID and add it as a new node:

$xmlfile = 'C:\path\to\your.xml'

[xml]$xml = Get-Content $xmlfile

$xml.SelectNodes('//RECORD') | ForEach-Object {
  $id = $_.SelectSingleNode('.//ITEMDESC').'#text' -replace '.*\((\d+)\).*', '$1'

  $sibling = $_.SelectSingleNode('./NETTOTAL')

  $node = $xml.CreateElement('SALESMAN')
  $node.InnerText = $id
  $_.InsertAfter($node, $sibling)
}

$xml.Save($xmlfile)
Ansgar Wiechers
  • 193,178
  • 25
  • 254
  • 328