1

I've just started using Powershell because I need to make a script that puts links in all files (htm files) of a folder. These links actually link all the files between them. I have a list of the files that are in the folder (this file is called list.txt and contains the name of the files without the extension)

In each file I want to make following changes:

From:

<tspan x="53" y="54.8">Surveillance_Err_PRG</tspan>

To:

<tspan x="53" y="54.8"><a href="C:/[...path...]/HTMs/Surveillance_Err_PRG.htm">Surveillance_Err_PRG</a></tspan>

After some research, I wrote following code, but it does nothing (the output just display my code):

$directory = "C:\Users\jacka\Desktop\Organigramme_PLC_prog_test\"
$list = "$directory" + "list.txt"
$htms = "$directory" + "HTMs"   

$htmFiles = Get-ChildItem $directory *.htm -rec
foreach ($file in $htmFiles)
{
    foreach($line in Get-Content $list)
    {
        if($line -match $regex)
        {
            $fichier = "$htms\"+"$line"+".htm"

            (Get-Content $file.PSPath) |
            Foreach-Object { $_ -replace "$line", "<a href=""$htms\$line"">$line</a>" } |
            Set-Content $file.PSPath
         }
         echo $fichier
    }
}

Before that, I had it like this:

foreach($line in Get-Content $list) {
    if($line -match $regex){
        $fichier = "$htms\"+"$line"+".htm"
        (Get-Content $fichier).replace("$line", "<a href=""$fichier"">$line</a>") | Set-Content $fichier
        echo $fichier
    }
}

It doesn't really work since it just puts a link on the inner title (in each htm there is the name of the document displayed on the top).

So there is a lot of information I know (but I wanted to give as much information as I could) and I'm sorry if I wasn't clear but basically I want to make the code above work for every file in my folder.

Thanks in advance!

Maximilian Burszley
  • 18,243
  • 4
  • 34
  • 63
Jack
  • 695
  • 10
  • 29
  • In you TO, there is a path. Is that coming from a variable or are you gonna hard code that ? – Ranadip Dutta Apr 03 '18 at 10:29
  • 2
    @RanadipDutta in my TO ? – Jack Apr 03 '18 at 10:32
  • In your "`To: Surveillance_Err_PRG`" -- What is the path ? Dynamic or static – Ranadip Dutta Apr 03 '18 at 11:01
  • [It's not recommended to use regex to parse html](https://stackoverflow.com/a/1732454/5039142). Instead consider [loading the file as html](https://stackoverflow.com/a/24989452/5039142) and [manipulating the html object](https://msdn.microsoft.com/en-us/library/office/aa219325(v=office.11).aspx) - I suspect you need `CreateElement` – G42 Apr 03 '18 at 11:06
  • @RanadipDutta Oh ok I did that because it is static so yes I hard code this for the moment because I'm still testing the code – Jack Apr 03 '18 at 11:23
  • @gms0ulman Thank you I'll try that but won't I get the same issues since I don't really know how to edit all the files in the folder ? – Jack Apr 03 '18 at 11:24

2 Answers2

2

So I found the solution

First, I had a problem there

$htmFiles = Get-ChildItem $directory *.htm -rec
    foreach ($file in $configFiles)

The variables weren't the same but then I got this error :

C:\Users\jacka\Desktop\Organigramme_PLC_prog_test\HTMs\Systeme_Filtration_Prg.htm
Get-Content : Impossible de trouver le chemin d'accès « C:\Users\jacka\ChargementProg_PRG.htm », car il n'existe pas.
Au caractère Ligne:22 : 14
+             (Get-Content $file) |
+              ~~~~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (C:\Users\jacka\ChargementProg_PRG.htm:String) [Get-Content], ItemNotFoundException
    + FullyQualifiedErrorId : PathNotFound,Microsoft.PowerShell.Commands.GetContentCommand

I solved this issue by adding .FullName after $file which prevented Get-Content trying to access the file from current directory :

$htmFiles = Get-ChildItem $directory *.htm -rec
foreach ($file in $htmFiles)
{

    foreach($line in Get-Content $list)
    {
        if($line -match $regex)
        {
            $fichier = "$directory"+"$line"+".htm"
            if ($file.FullName -ne $fichier) #to prevent header to be changed
            {
                (Get-Content $file.FullName) |
                Foreach-Object { $_ -replace "$line", "<a href=""$fichier"">$line</a>" } |
                Set-Content $file.FullName
            }
         }
    }
    echo "$file.FullName is done"
}
Jack
  • 695
  • 10
  • 29
2

Since you did not include the whole files I have create a simple source.html file:

<html>
<head>
<title>Website</title>
</head>
<body>
<tspan x="53" y="54.8">Surveillance_Err_PRG</tspan>
</body>
</html>

Next the issue you have is to parse HTML. As noted in the comments regexp are NOT a good way to parse html. In my eyes if you have a fairly complex html page/website etc. the best solution is to use html agility pack which is originally for .NET but it can be adjusted for powershell too.

For your example to get the final result you would have to do it like this: (note: Don't forget to change your path to the HtmlAgilityPack.dll)

Add-Type -Path 'C:\prg_sdk\nuget\HtmlAgilityPack.1.7.2\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load('C:\prg\PowerShell\test\SO\source.html')

$text = $doc.DocumentNode.SelectNodes("//tspan").InnerHTML
write-host $text

$out_text = $doc.DocumentNode.SelectNodes("//tspan").OuterHTML
write-host $out_text

$element = $doc.CreateTextNode("<a href=""c:\<your_path>\HTMs\$text.htm"">$text</a>")
$doc.DocumentNode.SelectSingleNode("//tspan").InnerHTML = $element.InnerText

$changed_text = $doc.DocumentNode.SelectSingleNode("//tspan").OuterHTML
Write-host "Adjusted text:" $changed_text

write-host 'whole HTML:' $doc.DocumentNode.SelectSingleNode("//tspan").OuterHtml

# To overview whole HTML
write-host 'whole HTML:' $doc.DocumentNode.InnerHTML

The write host will produce your wished:

<tspan x="53" y="54.8"><a href="c:\<your_path>\HTMs\Surveillance_Err_PRG.htm">Surveillance_Err_PRG</a></tspan>

To find a string in a file you could do something like this (just a snippet):

$html_files= Get-ChildItem . *.htm -rec
foreach ($file in $html_files)
{
    (Get-Content $file.PSPath) |
    Foreach-Object { $_ -replace "$out_text", "$changed_text" } |
    Set-Content $file.PSPath
}

To put it together you will have to loop via all the .htm files and replace it with above examples. If you want us complete it you would have to give me whole file example. I have done it on testing one:

Now all togeter it looks like this:

Add-Type -Path 'C:\prg_sdk\nuget\HtmlAgilityPack.1.7.2\lib\Net40-client\HtmlAgilityPack.dll'

$doc = New-Object HtmlAgilityPack.HtmlDocument
$result = $doc.Load('C:\prg\PowerShell\test\SO\source.html')

$text = $doc.DocumentNode.SelectNodes("//tspan").InnerHTML

$original_tag = $doc.DocumentNode.SelectNodes("//tspan").OuterHTML

$element = $doc.CreateTextNode("<a href=""c:\<your_path>\HTMs\$text.htm"">$text</a>")
$doc.DocumentNode.SelectSingleNode("//tspan").InnerHTML = $element.InnerText

$changed_tag = $doc.DocumentNode.SelectSingleNode("//tspan").OuterHTML

$html_files= Get-ChildItem . *.htm -rec
foreach ($file in $html_files)
{
    (Get-Content $file.PSPath) |
    Foreach-Object { $_ -replace "$original_tag", "$changed_tag" } |
    Set-Content $file.PSPath
}

I hope that the source code is clear, I have tried to make it readable (don't forget to change all the variables).

tukan
  • 17,050
  • 1
  • 20
  • 48
  • Thank you ! That's a very complete answer, I don't understand everything at the moment but I'll dig into it to fully understand how this works ! Thank you for the time you took ! – Jack Apr 03 '18 at 13:45
  • @JackA: please take your time digging into it. If you have questions don't hesitate to ask. When you think it answers your question don't forget to accept the answer - https://stackoverflow.com/help/someone-answers – tukan Apr 03 '18 at 13:52