0

I am reading a text file that looks like

<tr><td>W543562</td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td>W543563</td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td>W543564</td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td>W543565</td><td>OPEN</td><td>003</td><td>4</td></tr>
</tbody></table></div></div></body></html>

I am specifically interested in the W#. I want to grab the number, then write back to the text file to make it look like this to turn it into a hyperlink

<tr><td><a href="https://www.website.com/Order=W543562">W543562</a></td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td><a href="https://www.website.com/Order=W543563">W543563</a></td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td><a href="https://www.website.com/Order=W543564">W543564</a></td><td>OPEN</td><td>003</td><td>4</td></tr>
<tr><td><a href="https://www.website.com/Order=W543565">W543565</a></td><td>OPEN</td><td>003</td><td>4</td></tr>
</tbody></table></div></div></body></html>

What I have is

$text = [IO.File]::ReadAllText("C:\Temp\parse3.txt")
$url = "https://www.website.com/=W"

$Matches = [regex]::matches($text, "<td>W([\s\S]*?)</td>")
foreach ($match in $Matches)
{
    Write-Output $match.Groups[1].Value.Trim();
}

Which pulls the W# and displays it on each line, but I need to store each one into a variable and then use it to write back to each line and concatenate the $url

Ideally if I can cut the code down to something like Select-String "<td>W-</td>" | Add-Content $url+w# that would be great. But as far as I can tell, Select-String does not lend itself to selecting characters between others and trimming the beginning and end off. Much less find a specific range of dynamic characters.

Any ideas?

ZebulaCodes
  • 13
  • 1
  • 6
  • Does this answer your question? [RegEx match open tags except XHTML self-contained tags](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – Mad Physicist Mar 19 '21 at 21:19
  • @MadPhysicist There is a lot to unpack there. I was told in another post to study regex in order to solve my problem. While I am parsing through a .txt file using PowerShell to return text, it has nothing to do with the actual HTML tags. I am parsing plain text in order to do something like ```Add-Content``` back to the file. I guess, no. It does not solve my question. But nice try. – ZebulaCodes Mar 19 '21 at 21:26
  • Don't use RegEx to parse HTML. https://stackoverflow.com/a/1732454/1936966 . Use HTMLagilityPack or something like that – filimonic Mar 22 '21 at 08:49

1 Answers1

1

There are more efficient ways of doing this but if long term performance isn't an issue you can do something like this:

$text = [IO.File]::ReadAllText("C:\Temp\parse3.txt")
$url = "https://www.website.com/Order="

[regex]::Matches($text, "W\d{6}") | % { $text = $text -replace $_.Value, "<a href=`"$url$($_.Value)`">$($_.Value)</a>" }

$text

What's going on here...

[regex]::Matches

...finds all matches

"W\d{6}"

...finds occurences of W followed by any 6 digits via regex search

%

...can also be written as Foreach-Object. You're piping the outputs to the following script block. $_ refers to each of the individual matches found in the pipeline.

-replace

... is another PowerShell regex function to replace regex patterns.

The rest just specifies the value that you want to replace it with using an interpolated string. String interpolation needs to be done inside double quotes, so double quotes for the href reference inside the string need to be escaped using a backtick. Interpolated variables that are dot-referenced like $_.Value need to be enclosed in a $(...) structure inside the string.

Many other, probably better ways to do this but hopefully this helps.

Efie
  • 1,430
  • 2
  • 14
  • 34
  • This is fantastic. However is there a way to save the updated file with the new values? It prints it perfect in ISE, but does not update the file itself. I tried adding ```"<$url$($_.Value)`">$($_.Value)" } | Out-File C:\Temp\parse4.txt``` to the end but it saved a blank file. – ZebulaCodes Mar 19 '21 at 21:48
  • So the script block changes the value of $text but doesn't output it back into the pipeline. Simply removing ```| Out-File C:\Temp\parse4.txt $text``` and adding another line ```Out-File C:\Temp\parse4.txt $text``` should do the trick. – Efie Mar 19 '21 at 21:51
  • Perfect! I ran into a ```System.OutOfMemoryException``` but I will have to work through that somehow. Thank you so much! – ZebulaCodes Mar 19 '21 at 22:01
  • How big is the file you're running this on? – Efie Mar 19 '21 at 22:02
  • It is 773KB large. – ZebulaCodes Mar 19 '21 at 22:03
  • I'm not sure if you're familiar with file streams in powershell but if your file is so big you're running into memory issues you may need to use something like that and replace 'in place' as you go, line by line. – Efie Mar 19 '21 at 22:04
  • I am not but it will be something I can check into. I've also just been made aware that some of the W#'s will look something like WEL11 and W2310OS so this will not work for them. I wish I had known that before making the post. Still, the answer you gave was perfect for the W+6 digit variation – ZebulaCodes Mar 19 '21 at 22:07
  • Unfortunately I have to go make dinner, but you should probably do something like is mentioned here for your situation, where you read chunks at a time, do your replace operations, and then dump the replaced text in memory into a file. https://stackoverflow.com/questions/2783837/find-and-replace-in-a-large-file. Good luck, edit your post if you have trouble. Someone will help you out :) – Efie Mar 19 '21 at 22:07
  • Regex supports 'or' operators so you can use ```"W\d{6}|WEL\d{2}|W\d{4}OS"``` for example to grab those as well. You have to be careful on such large files, though because if there is something in a URL that matches, etc... it will also be replaced. It does happen. You may need to make the regex more specific. Look into regex 'lookarounds' here: https://www.regular-expressions.info/lookaround.html – Efie Mar 19 '21 at 22:12