1

I wrote a simple script that goes through xml file, takes random lines with specified tag (url) and strips everything that encapsulate link itself:

$importPath = "C:\PATH\feed.xml"

# get links
$link = Select-String '<loc>' $importPath

$count = 20

# randomize
$link = Get-Random -InputObject $link  -Count $count 

#strip
$link1 = $link -replace ".*<loc>" -replace "</loc>"

$rez = $link1 -join("`n") 

Write-Host $rez -ForegroundColor Green

This works. However, I wonder if there is any way to improve this part, so I don't have to manually adjust it for each feed:

$link1 = $link -replace ".*<loc>" -replace "</loc>"

Since tag name can vary in name and length, I figured I could just use tag brackets (since that is a constant in every feed) to indicate where to start trimming.

$link1 = $link -replace ".*<" -replace "<.*"

Which obviously doesn't works since there is no distinction in which bracket should be considered the first one, and which should be considered the second one.

For example:

<tagnamethatvaries>https://somesite.com/somepath</tagnamethatvaries>

If I use

$link1 = $link -replace ".*<" -replace "<.*" 

i get

/tagnamethatvaries>

Is there any way to declare point for the same character in string that varies in length?

Nergal
  • 13
  • 2

3 Answers3

0

I cannot comment to get further information due to my reputation not being high enough.

Please see the below, if you are trying to trim the end of the string based on location of 2nd occurrence of a character it could be done by using substring and indexof.

Please see below:

$link = "<tagnamethatvaries>https://somesite.com/somepath</tagnamethatvaries>"
$link1 = $link.Substring(0, $link.IndexOf("<",2))

This gives the result of:

<tagnamethatvaries>https://somesite.com/somepath

This removes the 1st tag

$link = "<tagnamethatvaries>https://somesite.com/somepath</tagnamethatvaries>"
$link1 = $link.Substring(($link.IndexOf(">",1)+1),($link.IndexOf("<",2))+1)

Result is

https://somesite.com/somepath</tagnamethatvaries>

Hopefully this helps.

CraftyB
  • 721
  • 5
  • 11
0

Looks like you are trying to get the content between XML tags.
There is a simpler way to achieve that, by using the regex match and capture groups

Assuming $feed is your feed.xml content, running the following script:

$feed = @(
"<foo>foo-link1</foo>"
"<bar>bar-link2</bar>")

foreach ($link in $feed) { 
    if ($link -match "<.*>(.*)<.*>") { 
        Write-Host $Matches[1] 
    } 
}

Would write to your console:

foo-link1
bar-link2

You can also extend the functionality to capture only the tags that you are interested on.

$feed = @(
"<foo>foo-link1</foo>"
"<bar>bar-link2</bar>")

$tagsToFind = @(
"foo"
"bar"
)

foreach ($link in $feed) { 
    foreach ($tag in $tagsToFind){
        if ($link -match "<$tag>(.*)</$tag>") { 
            Write-Host $Matches[1] 
        } 
    }
}
JoeBigToe
  • 941
  • 1
  • 5
  • 10
0

In general better use XML tools to work with xml files.

If you nevertheless need to, I'd use a RegEx with look arounds and a back reference to match the same tag name just with the / in between using a Select-String which already extracts the pure links:

Select-String  "C:\PATH\feed.xml" -Pattern '(?<=<([^>]+>))(http[^<]+)(?=</\1)' | 
    ForEach-Object {$_.Matches.Groups[2].Value} | Get-Random -Count 20

Where:

(?<=<([^>]+>))

is a positive look behind (?<= matching a literal < followed by at least one/as much as possible characters which are not a > and are enclosed in parentheses to form the 1st capture group lateron used as the back reference \1.

(http[^<]+)

captures the link starting with http and ending before the closing tag.

(?=</\1)

is a positive look ahead (?= beginning with the </ and the tag name from the 1st capture group.

The matches collection from the sls is iterated through with ForEach-Object and reduced to the links from the 2nd capture group