Find a string in a file, run a script on it, and replace the original string with the new string using PowerShell

Question

Goal

Using PowerShell, find a string in a file, run a simple transformation script on the string, and replace the original string with the new string in the same file

Details

The file is a Markdown file with one or more HTML blocks inside.
The goal is to make the entire file Markdown with no HTML.
Pandoc is a command-line HTML-to-Markdown transformation tool that easily transforms HTML to Markdown.
The transformation script is a Pandoc script.
Pandoc alone cannot transform a Markdown file that includes HTML to Markdown.
Each HTML block a is one long string with no line breaks (see example below).
The HTML is a little rough and sometimes not valid; despite this, Pandoc handles much of the transformation successfully. This may not be relevant.
I cannot change the fact that the file is generated originally as part Markdown/part HTML, that the HTML is sometimes invalid, or that each HTML block is all on one line.
PowerShell is required because that's the scripting language my team supports.

Example file of mixed Markdown/HTML code; most HTML is invalid

# Heading 1
Text

# Heading 2
<h3>Heading 3</h3><p>I am all on one line</h><span><div>I am not always valid HTML</div></span><br><h4>Heading 4<h4><ul><li>Item<br></li><li>Item</li><ul><span></span><img src="url" style="width:85px;">

# Heading 3
Text

# Heading 4
<h2>Heading 1</h2><div>Text</div><h2>Heading 2</h2><div>Text</div>

# Heading 5
<div><ul><li>Item</li><li>Item</li><li>Item</li></ul></div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code>

Text

Code for transformation script

pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

Attempts

I surrounded each HTML block with a <start> and <end> tag with the goal to extract the text in between those tags with a regex, run the Pandoc script on it, and replace the original text. My plan was to run a foreach loop to iterate through each block one by one.

This attempt transforms the HTML to Markdown, but does not return the original Markdown with it:

$file = 'file.md'
$regex = '<start>.*?<end>'
$a = Get-Content $file -Raw
$a | Select-String $regex -AllMatches | ForEach-Object {$_.Matches.Value} | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

This poor attempt seeks to perform the replace, but only returns the original file with no changes:

$file = 'file.md'
$regex = '<start>.*?<end>'
$content = Get-Content $file -Raw

$a = $content | Select-String $regex -AllMatches
$b = $a | ForEach-Object {$_.Matches } | Foreach-Object {$_.Value} | Select-Object | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers

$content | ForEach-Object {
    $_ -replace $a,$b }

I am struggling to move beyond these attempts. I am new at PowerShell. If this approach is wrong entirely I would be grateful to know. Thank you for any advice.

https://stackoverflow.com/a/1732454/62576 – Ken White Nov 19 '18 at 02:11 — Ken White, Nov 19 '18 at 02:11

score 1 · Accepted Answer · answered Nov 19 '18 at 03:53

Given the line-oriented nature of your input, you can process your input file line by line and decide for each line whether it needs transformation or not:

$file = 'file.md'
(Get-Content $file | ForEach-Object {
  if ($_ -match '^<') { # Is this an HTML line? - you could make this regex stricter
    $_ | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
  } else { # A non-HTML line, pass through as-is
    $_
  }
}) | Set-Content -Encoding Utf8 $file # be sure to choose the desired encoding

Note the (...) around the pipeline before Set-Content, which ensures that $file is read into memory in full up front, which allows writing back to the same file - do note that this convenient approach bears the slight risk of data loss, however, if the command is interrupted before writing completes; always create a backup of the input files first.