Goal
Using PowerShell, find a string in a file, run a simple transformation script on the string, and replace the original string with the new string in the same file
Details
- The file is a Markdown file with one or more HTML blocks inside.
- The goal is to make the entire file Markdown with no HTML.
- Pandoc is a command-line HTML-to-Markdown transformation tool that easily transforms HTML to Markdown.
- The transformation script is a Pandoc script.
- Pandoc alone cannot transform a Markdown file that includes HTML to Markdown.
- Each HTML block a is one long string with no line breaks (see example below).
- The HTML is a little rough and sometimes not valid; despite this, Pandoc handles much of the transformation successfully. This may not be relevant.
- I cannot change the fact that the file is generated originally as part Markdown/part HTML, that the HTML is sometimes invalid, or that each HTML block is all on one line.
- PowerShell is required because that's the scripting language my team supports.
Example file of mixed Markdown/HTML code; most HTML is invalid
# Heading 1
Text
# Heading 2
<h3>Heading 3</h3><p>I am all on one line</h><span><div>I am not always valid HTML</div></span><br><h4>Heading 4<h4><ul><li>Item<br></li><li>Item</li><ul><span></span><img src="url" style="width:85px;">
# Heading 3
Text
# Heading 4
<h2>Heading 1</h2><div>Text</div><h2>Heading 2</h2><div>Text</div>
# Heading 5
<div><ul><li>Item</li><li>Item</li><li>Item</li></ul></div><code><pre><code><div>Code line 1</div><div>Code line 2</div><div>Code line 3</div></code></pre></code>
Text
Code for transformation script
pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
Attempts
I surrounded each HTML block with a <start>
and <end>
tag with the goal to extract the text in between those tags with a regex, run the Pandoc script on it, and replace the original text. My plan was to run a foreach
loop to iterate through each block one by one.
This attempt transforms the HTML to Markdown, but does not return the original Markdown with it:
$file = 'file.md'
$regex = '<start>.*?<end>'
$a = Get-Content $file -Raw
$a | Select-String $regex -AllMatches | ForEach-Object {$_.Matches.Value} | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
This poor attempt seeks to perform the replace, but only returns the original file with no changes:
$file = 'file.md'
$regex = '<start>.*?<end>'
$content = Get-Content $file -Raw
$a = $content | Select-String $regex -AllMatches
$b = $a | ForEach-Object {$_.Matches } | Foreach-Object {$_.Value} | Select-Object | pandoc -f html -t 'markdown_strict-raw_html-native_divs-native_spans-bracketed_spans' --atx-headers
$content | ForEach-Object {
$_ -replace $a,$b }
I am struggling to move beyond these attempts. I am new at PowerShell. If this approach is wrong entirely I would be grateful to know. Thank you for any advice.